Systems Engineer, Metrics and Alerting at Cloudflare

About Us

At Cloudflare, we are on a mission to help build a better Internet. Today the company runs one of the world’s largest networks that powers millions of websites and other Internet properties for customers ranging from individual bloggers to SMBs to Fortune 500 companies. Cloudflare protects and accelerates any Internet application online without adding hardware, installing software, or changing a line of code. Internet properties powered by Cloudflare all have web traffic routed through its intelligent global network, which gets smarter with every request. As a result, they see significant improvement in performance and a decrease in spam and other attacks. Cloudflare was named to Entrepreneur Magazine’s Top Company Cultures list and ranked among the World’s Most Innovative Companies by Fast Company.

At Cloudflare, we’re not looking for people who wait for a polished roadmap; we’re looking for the builders who see the cracks in the Internet that everyone else has simply learned to live with. We value candidates who have the instinct to spot a "normalized" problem and the AI-native curiosity to create a solution using the latest tools. Our culture is built on iteration, leveraging AI to ship faster today to make it better tomorrow, while ensuring that every improvement, no matter how small, is shared across the team to lift everyone up. If you’re the type of person who values curiosity over bureaucracy, and that AI is a partner in solving tough problems to keep the Internet moving forward, you’ll fit right in.

**Available Locations: London or Lisbon **

About the Department

Production Engineering is responsible for the world’s most reliable, observable, performant, and safe network ecosystem. Our customers rely on our products and systems to safely modify, troubleshoot, and release products without external impact.

Our external customers rely on us to provide seamless and predictable incident, traffic, policy management, resulting in the fastest and safest network services in the world.

We are accountable for the overall performance of internal and external facing services, guiding our product teams to optimal configurations and maximum efficiency. From the moment that a packet enters the Cloudflare ecosystem, we know exactly what its expected purpose and behaviour is and we are capable of determining and exposing anomalous behaviour.

The Cloudflare network makes it possible to solve challenges at massive scale and efficiency which would be impossible for almost any other organization.

About the Team

This role is for the internal Observability Team, responsible for the observability platform and stack to make our engineering teams productive. This includes (but is not limited to) areas like metrics, alerting, error tracking, logging, tracing, and more.

In this role, you can expect to:

Design, deliver, and operate software and a platform that progresses Cloudflare's Observability competency
Solve scaling bottlenecks in critical services in our Metrics & Alerting pipeline
Work on highly distributed and scalable systems
Participate in the constant cycle of knowledge sharing and mentoring
Participate in the global on-call rotation for the services your team owns
Research and introduce cutting-edge technologies
Contribute to open-source

We are a small team, well-funded, growing and focused on building an extraordinary company. This is a software engineering/systems engineering role and is a superb opportunity to be part of a high performing team to help to support Cloudflare’s mission and help build a better internet.

You may be a good fit for our team if you have:

A Software Engineering background and proficiency in high-level programming languages (e.g., Go)
Proficiency in Data structures and databases like TSDBs, Columnar stores or related
Proficiency in distributed Linux environments
Proficiency in designing high-scale distributed systems
Proficiency in Prometheus, Alertmanager, Thanos
Experience working in a fast, high-growth environment
Experience working in a 24/7/365 service environment
Exquisite written and verbal communication skills
Familiarity with Internetworking, networking protocols Layer 2-7 of the OSI model and BGP
Strong bias for action

Bonus points if you have:

Experience with high-bandwidth transit Internetworking and routing
Passion for code simplicity and performance

What Makes Cloudflare Special?

We’re not just a highly ambitious, large-scale technology company. We’re a highly ambitious, large-scale technology company with a soul. Fundamental to our mission to help build a better Internet is protecting the free and open Internet.

**Project Galileo**: Since 2014, we've equipped more than 2,400 journalism and civil society organizations in 111 countries with powerful tools to defend themselves against attacks that would otherwise censor their work, technology already used by Cloudflare’s enterprise customers--at no cost.

Athenian Project: In 2017, we created the Athenian Project to ensure that state and local governments have the highest level of protection and reliability for free, so that their constituents have access to election information and voter registration. Since the project, we've provided services to more than 425 local government election websites in 33 states.

**1.1.1.1**: We released1.1.1.1 to help fix the foundation of the Internet by building a faster, more secure and privacy-centric public DNS resolver. This is available publicly for everyone to use - it is the first consumer-focused service Cloudflare has ever released. Here’s the deal - we don’t store client IP addresses never, ever. We will continue to abide by ourprivacy commitment and ensure that no user data is sold to advertisers or used to target consumers.

Sound like something you’d like to be a part of? We’d love to hear from you!

Please note that applicants who progress to the offer stage of the interview process may be asked to attend an in-person interview within one of the Cloudflare Offices or Cloudflare Hubs. More details about this will be available at that stage of the interview process.

This position may require access to information protected under U.S. export control laws, including the U.S. Export Administration Regulations. Please note that any offer of employment may be conditioned on your authorization to receive software or technology controlled under these U.S. export laws without sponsorship for an export license.

Cloudflare is proud to be an equal opportunity employer. We are committed to providing equal employment opportunity for all people and place great value in both diversity and inclusiveness. All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law. We are an AA/Veterans/Disabled Employer.

Cloudflare provides reasonable accommodations to qualified individuals with disabilities. Please tell us if you require a reasonable accommodation to apply for a job. Examples of reasonable accommodations include, but are not limited to, changing the application process, providing documents in an alternate format, using a sign language interpreter, or using specialized equipment. If you require a reasonable accommodation to apply for a job, please contact us via e-mail at hr@cloudflare.com or via mail at 101 Townsend St. San Francisco, CA 94107.

About Us

**Available Locations: London or Lisbon **

About the Department

Our external customers rely on us to provide seamless and predictable incident, traffic, policy management, resulting in the fastest and safest network services in the world.

The Cloudflare network makes it possible to solve challenges at massive scale and efficiency which would be impossible for almost any other organization.

About the Team

In this role, you can expect to:

Design, deliver, and operate software and a platform that progresses Cloudflare's Observability competency
Solve scaling bottlenecks in critical services in our Metrics & Alerting pipeline
Work on highly distributed and scalable systems
Participate in the constant cycle of knowledge sharing and mentoring
Participate in the global on-call rotation for the services your team owns
Research and introduce cutting-edge technologies
Contribute to open-source

You may be a good fit for our team if you have:

A Software Engineering background and proficiency in high-level programming languages (e.g., Go)
Proficiency in Data structures and databases like TSDBs, Columnar stores or related
Proficiency in distributed Linux environments
Proficiency in designing high-scale distributed systems
Proficiency in Prometheus, Alertmanager, Thanos
Experience working in a fast, high-growth environment
Experience working in a 24/7/365 service environment
Exquisite written and verbal communication skills
Familiarity with Internetworking, networking protocols Layer 2-7 of the OSI model and BGP
Strong bias for action

Bonus points if you have:

Experience with high-bandwidth transit Internetworking and routing
Passion for code simplicity and performance

What Makes Cloudflare Special?

Sound like something you’d like to be a part of? We’d love to hear from you!

Systems Engineer, Metrics and Alerting

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

About the Team

In this role, you can expect to:

About the Team

In this role, you can expect to: