Site Reliability Engineer (senior or Staff), Storage Layer Services (sls)

MongoDB MongoDB · Enterprise · New York, NY · PTO Site Reliability Engineering

Site Reliability Engineer (Senior or Staff) for MongoDB's Storage Layer Services (SLS) team, focusing on re-architecting and ensuring the reliability, durability, and operational safety of the next-generation cloud storage architecture. The role involves partnering with storage service teams to define SLOs, shape capacity plans, and optimize infrastructure performance.

What you'd actually do

  1. Work on our multi-tenant distributed storage systems, balancing long-term strategic infrastructure goals with immediate engineering needs
  2. Build for reliability, making services and infrastructure available, resilient, fault-tolerant, and self-healing
  3. Identify and configure key metrics to detect incidents and quantify service health, availability, and performance
  4. Participate in a 24/7 on-call rotation to resolve issues involving the storage infrastructure
  5. Become an expert in infrastructure performance, helping us optimize from the application level all the way to the kernel

Skills

Required

  • 6+ years of experience working on software development and operating distributed systems
  • Proficiency in Python, Go, or a similar language
  • Operated or supported stateful storage or database systems at scale
  • Experience using and extending containerization technologies, particularly Kubernetes
  • Expertise in cloud infrastructure platforms, including AWS, Google Cloud Platform (GCP), or Azure
  • Understanding of Linux operating system internals and networking concepts

Nice to have

  • Leading major architectural shifts
  • Managing and scaling infrastructure across multi-cloud environments
  • Designing secure, multi-tenant runtime environments at scale