Senior Software Engineer, Infrastructure

Decagon Decagon · Vertical AI · New York, NY · Engineering

Senior Infrastructure Engineer at Decagon, a conversational AI platform company. The role focuses on designing, building, and operating production infrastructure for high-scale, low-latency systems, specifically supporting ML Infra for LLM inference. Responsibilities include owning critical services, improving reliability and performance, and enhancing developer tooling. The role requires experience in infrastructure at scale, meeting high availability and low latency targets, and strong observability skills.

What you'd actually do

  1. Design and implement critical infrastructure services with strong SLOs, clear runbooks, and actionable telemetry.
  2. Partner with research and product teams to architect solutions, set up prototypes, evaluate performance, and scale new features.
  3. Tune service latencies: optimize networking paths, apply smart caching/queuing, and tune CPU/memory/I/O for tight p95/p99s.
  4. Evolve CI/CD, golden paths, and self-service tooling to improve developer velocity and safety.
  5. Support various deployment architectures for customers with robust observability and upgrade paths.

Skills

Required

  • 5+ years building and operating production infrastructure at scale
  • Depth in at least one area across Core/Data/AI-ML/Platform/Voice
  • Proven track record meeting high availability and low latency targets (owning SLOs, p95/p99, and load testing)
  • Excellent observability chops (OpenTelemetry, Prometheus/Grafana, Datadog) and incident response (PagerDuty, SLO/error budgets)
  • Clear written communication and the ability to turn ambiguous requirements into simple, reliable designs

Nice to have

  • Experience being an early backend/platform/infrastructure engineer at another company
  • Strong Kubernetes experience (GKE/EKS/AKS) and experience across multiple cloud providers (GCP, AWS, and Azure)
  • Experience with customer‑managed deployments

What the JD emphasized

  • high availability
  • low latency