Senior Infrastructure Engineer

Replit Replit · Enterprise · Foster City, CA · Hybrid · Engineering

Senior Infrastructure Engineer role focused on ensuring the reliability, scalability, and performance of Replit's platform infrastructure, including automation, CI/CD, cloud deployments, and developer experience improvements. Requires strong SRE/DevOps experience, programming skills (Python/Go), distributed systems knowledge, and experience with Kubernetes, Terraform, and observability tools.

What you'd actually do

  1. Build and improve automation to eliminate toil and operational work. Maintain CI/CD pipelines and infrastructure automation using tools like Terraform or Pulumi. Create self-healing systems that can automatically respond to common failure scenarios.
  2. Collaborate with core infrastructure and product teams to performance tune and optimize our cloud deployments (Kubernetes, Docker, GCP). Identify and resolve performance bottlenecks and implement capacity planning strategies.
  3. Design and implement improvements to our build, test, and deployment systems to make software delivery faster, safer, and more reliable for all engineers.
  4. Partner with service owners across Replit to understand their pain points, and collaborate on implementing build/test/deploy enhancements within their specific services.
  5. Create and maintain centralized tooling and automation that improves the engineering lifecycle, from local development to production monitoring.

Skills

Required

  • Site Reliability Engineering
  • DevOps
  • Systems Engineering
  • Infrastructure Engineering
  • Python
  • Go
  • Distributed Systems
  • Service-Oriented Architecture
  • Kubernetes
  • Container Orchestration
  • Cloud-Native Technologies
  • Monitoring
  • Observability
  • Debugging
  • Performance Tuning
  • Incident Management
  • Incident Response
  • Infrastructure as Code
  • Terraform
  • Configuration Management
  • Communication Skills

Nice to have

  • Google Cloud Platform (GCP)
  • Prometheus
  • Grafana
  • Datadog
  • High Throughput Systems
  • Low Latency Systems
  • Rapid-Growth Environments