Site Reliability Engineer 3

MongoDB MongoDB · Enterprise · New York, NY · PTO Site Reliability Engineering

MongoDB is seeking a Site Reliability Engineer to design and build the global infrastructure for their flagship MongoDB Atlas platform. This role focuses on ensuring low-latency, resilient, and scalable services across multiple cloud providers, with a strong emphasis on automation and infrastructure-as-code. The SRE team is integrated with other engineering teams and plays a critical role in maintaining the reliability and performance of MongoDB's services, which support AI-powered applications.

What you'd actually do

  1. Design and build the infrastructure for a global cloud service that comprises hundreds of thousands of MongoDB clusters, processes a billion metrics per day, and replicates tens of billions of database writes to our backup service
  2. Design, implement, and troubleshoot the automation and monitoring of services that seamlessly spans the globe - including several cloud providers
  3. Become an expert in infrastructure performance, helping us optimize from the application level all the way through the firmware
  4. Build for resilience. Our goal is that nobody’s pager goes off, ever. Are we there yet? No. Are we really close? Very. While we work on that - participate in a weekly on-call rotation
  5. Improve our infrastructure capabilities, optimizing for cost, simplicity, and maintainability

Skills

Required

  • 3+ years of experience running a mission critical service at scale in a Linux environment
  • Firm grasp of at least one modern programming language, beyond basic scripting
  • Familiarity with web and network protocols and standards (HTTP, TLS, DNS, etc)
  • Bachelor’s degree in Computer Science or equivalent experience
  • Experience writing automation tools & eagerness to "automate all the things"

Nice to have

  • Experience building large applications from scratch, complete with CI/CD infrastructure
  • Experience in networking, security, hardware or OS performance tuning
  • Experience with at least one of the major cloud providers (Amazon Web Services, Google Compute, Microsoft Azure)
  • Experience managing kubernetes clusters or some other container orchestration infrastructure
  • Experience with observability of large scale distributed systems

What the JD emphasized

  • mission critical service at scale
  • automation and monitoring
  • infrastructure performance
  • resilience
  • optimize from the application level all the way through the firmware