Staff Software Engineer, Infrastructure

Decagon Decagon · Vertical AI · San Francisco, CA · Engineering

Staff Software Engineer, Infrastructure at Decagon, a conversational AI platform company. This role focuses on building and operating production infrastructure for high-scale, low-latency systems, including ML infra for LLM inference. The engineer will own critical services, improve reliability and performance, and enhance developer tooling. Requires 8+ years of experience in production infrastructure, with depth in areas like Core/Data/AI-ML/Platform/Voice, and a proven track record in meeting high availability and low latency targets.

What you'd actually do

  1. Design and implement critical infrastructure services with strong SLOs, clear runbooks, and actionable telemetry.
  2. Partner with research and product teams to architect solutions, set up prototypes, evaluate performance, and scale new features.
  3. Tune service latencies: optimize networking paths, apply smart caching/queuing, and tune CPU/memory/I/O for tight p95/p99s.
  4. Evolve CI/CD, golden paths, and self‑service tooling to improve developer velocity and safety.
  5. Support various deployment architectures for customers with robust observability and upgrade paths.

Skills

Required

  • 8+ years building and operating production infrastructure at scale
  • Depth in at least one area across Core/Data/AI-ML/Platform/Voice
  • Proven track record meeting high availability and low latency targets (owning SLOs, p95/p99, and load testing)
  • Excellent observability chops (OpenTelemetry, Prometheus/Grafana, Datadog) and incident response (PagerDuty, SLO/error budgets)
  • Clear written communication and the ability to turn ambiguous requirements into simple, reliable designs

Nice to have

  • Experience being an early backend/platform/infrastructure engineer at another company
  • Strong Kubernetes experience (GKE/EKS/AKS) and experience across multiple cloud providers (GCP, AWS, and Azure)
  • Experience with customer‑managed deployments

What the JD emphasized

  • high availability
  • low latency