Site Reliability Engineer 3

MongoDB MongoDB · Enterprise · New York, NY · PTO Site Reliability Engineering

Site Reliability Engineer responsible for designing and building the global infrastructure for MongoDB Atlas, focusing on low-latency, scalability, and resilience. The role involves automation, monitoring, and optimizing infrastructure performance across multiple cloud providers, with a weekly on-call rotation.

What you'd actually do

  1. Design and build the infrastructure for a global cloud service that comprises hundreds of thousands of MongoDB clusters, processes a billion metrics per day, and replicates tens of billions of database writes to our backup service
  2. Design, implement, and troubleshoot the automation and monitoring of services that seamlessly spans the globe - including several cloud providers
  3. Become an expert in infrastructure performance, helping us optimize from the application level all the way through the firmware
  4. Build for resilience. Our goal is that nobody’s pager goes off, ever. Are we there yet? No. Are we really close? Very. While we work on that - participate in a weekly on-call rotation
  5. Improve our infrastructure capabilities, optimizing for cost, simplicity, and maintainability

Skills

Required

  • 3+ years of experience running a mission critical service at scale in a Linux environment
  • Firm grasp of at least one modern programming language, beyond basic scripting
  • Familiarity with web and network protocols and standards (HTTP, TLS, DNS, etc)
  • Bachelor’s degree in Computer Science or equivalent experience
  • Experience writing automation tools & eagerness to "automate all the things"

Nice to have

  • Experience building large applications from scratch, complete with CI/CD infrastructure
  • Experience in networking, security, hardware or OS performance tuning
  • Experience with at least one of the major cloud providers (Amazon Web Services, Google Compute, Microsoft Azure)
  • Experience managing kubernetes clusters or some other container orchestration infrastructure
  • Experience with observability of large scale distributed systems

What the JD emphasized

  • mission critical service at scale
  • firm grasp of at least one modern programming language, beyond basic scripting
  • eagerness to "automate all the things"