Site Reliability Engineer (senior or Staff), Storage Layer Services (sls)

MongoDB MongoDB · Enterprise · Dublin, Ireland · PTO Site Reliability Engineering

Site Reliability Engineer (Senior or Staff) for MongoDB's Storage Layer Services (SLS) team, focusing on re-architecting and ensuring the reliability, durability, and operational safety of the storage layer for MongoDB Atlas. The role involves building performant, multi-tenant distributed storage services, defining SLOs, shaping capacity plans, and optimizing infrastructure performance.

What you'd actually do

  1. Work on our multi-tenant distributed storage systems, balancing long-term strategic infrastructure goals with immediate engineering needs
  2. Build for reliability, making services and infrastructure available, resilient, fault-tolerant, and self-healing
  3. Identify and configure key metrics to detect incidents and quantify service health, availability, and performance
  4. Participate in a 24/7 on-call rotation to resolve issues involving the storage infrastructure
  5. Become an expert in infrastructure performance, helping us optimize from the application level all the way to the kernel

Skills

Required

  • Python, Go, or similar language
  • Operating distributed systems
  • Kubernetes
  • AWS, GCP, or Azure
  • Linux operating system internals
  • Networking concepts (TCP/IP, DNS, TLS, routing)

Nice to have

  • Stateful storage or database systems at scale
  • Leading major architectural shifts
  • Managing and scaling infrastructure across multi-cloud environments

What the JD emphasized

  • 6+ years of experience working on software development and operating distributed systems
  • operated or supported stateful storage or database systems at scale
  • Understanding of Linux operating system internals and networking concepts
  • Leading major architectural shifts
  • Managing and scaling infrastructure across multi-cloud environments