Senior Site Reliability Engineer, Fleet Management

MongoDB MongoDB · Enterprise · New York, NY +2 · Remote · PTO Site Reliability Engineering

Senior Site Reliability Engineer focused on fleet management, specifically the Kubernetes runtime environment and its core components. The role involves contributing to scalable and secure infrastructure, providing internal support, and participating in on-call rotations. Experience with Go, Python, Kubernetes, Linux internals, and distributed systems is required.

What you'd actually do

  1. Contribute to developing and maintaining a scalable and secure runtime environment on top of Kubernetes that supports product needs across MongoDB
  2. Provide internal support for our Kubernetes ecosystem, partnering with engineering teams to help them solve domain-specific problems
  3. Participate in a 24/7 on-call rotation to resolve critical issues

Skills

Required

  • Go
  • Python
  • Kubernetes
  • Linux operating system internals
  • networking concepts
  • distributed systems
  • software development
  • operational ownership
  • debugging production issues

Nice to have

  • Helm
  • Kustomize
  • Gatekeeper
  • Kyverno
  • CRDs/Operators
  • CRI
  • CSI
  • AWS
  • GCP
  • Azure
  • Terraform
  • Crossplane
  • ACK

What the JD emphasized

  • 6+ years of experience in software development and operating distributed systems
  • strong commitment to code quality and testing practices
  • deep experience using and extending containerization technologies, preferably Kubernetes
  • solid understanding of Linux operating system internals and networking concepts
  • strong operational ownership, including a track record of debugging complex production issues and driving them to resolution
  • Prefer automation over manual processes ("allergic to ops work")
  • small team of software engineers with a strong bias toward building software solutions to eliminate toil