Staff Infrastructure Engineer

Replit Replit · Enterprise · Foster City, CA · Hybrid · Engineering

Staff Infrastructure Engineer role focused on ensuring the reliability, scalability, and performance of Replit's platform infrastructure. Responsibilities include driving automation, optimizing cloud deployments (Kubernetes, GCP), elevating developer experience, building shared tooling, debugging complex systems, and mentoring the engineering team. Requires strong programming skills (Python/Go), distributed systems knowledge, experience with Kubernetes, IaC, and observability solutions.

What you'd actually do

  1. Architect, build, and improve automation to eliminate toil and operational work. Design and maintain CI/CD pipelines and infrastructure automation using tools like Terraform or Pulumi. Create self-healing systems that can automatically respond to common failure scenarios.
  2. Collaborate with core infrastructure and product teams to performance tune and optimize our cloud deployments (Kubernetes, Docker, GCP). Identify and resolve performance bottlenecks, implement capacity planning strategies, and reduce latency across global regions.
  3. Design and implement improvements to our build, test, and deployment systems to make software delivery faster, safer, and more reliable for all engineers.
  4. Partner directly with service owners across Replit to understand their pain points, and collaborate on implementing build/test/deploy enhancements within their specific services.
  5. Create and maintain centralized tooling and automation that improves the entire engineering lifecycle, from local development to production monitoring.

Skills

Required

  • Python
  • Go
  • Kubernetes
  • Terraform
  • GCP
  • Distributed Systems
  • Infrastructure as Code
  • Monitoring
  • Observability
  • Debugging
  • Performance Tuning
  • Incident Management
  • Site Reliability Engineering

Nice to have

  • Prometheus
  • Grafana
  • Datadog
  • High throughput systems
  • Low latency systems
  • Rapid-growth environments
  • Technical writing
  • Training material creation

What the JD emphasized

  • 8-10 years of experience in Infrastructure Engineering or similar roles (DevOps, Systems Engineering, Site Reliability Engineering)
  • Strong programming skills in languages like Python or Go
  • Deep understanding of distributed systems. You’ve designed, built, scaled, and maintained production services and know how to compose a service-oriented architecture.
  • Experience with container orchestration platforms (Kubernetes) and cloud-native technologies.
  • Proven track record of implementing and maintaining monitoring/observability solutions, with strong skills in debugging and performance tuning.
  • Strong incident management skills with experience leading incident response and demonstrated critical thinking under pressure.
  • Experience with infrastructure as code (e.g., Terraform) and configuration management tools.