Senior Site Reliability Enigneer

Synthesia Synthesia · Multimodal · United States · Remote · Engineering

Senior Site Reliability Engineer for an AI video platform company. The role focuses on operational excellence, incident management, automation, reliability engineering, vendor management (including LLM API providers), and FinOps within the Cloud Infrastructure team. The engineer will own and operate systems, build automation, and ensure the reliability and efficiency of the platform.

What you'd actually do

  1. Incident management & operational excellence — take custody of the incident process: on-call quality, response, post-mortems, and driving down incident count, time-to-detect, and time-to-resolve.
  2. Automation & reliability engineering — automate low-frequency, high-consequence operations (the certificate-renewal class of problem — rare, easy to forget, outage-causing when missed), not just the high-frequency toil. You decide what to automate based on risk and blast radius, not just time saved.
  3. A platform domain — over time, deep ownership of a domain such as Temporal, observability, or Kubernetes operations, partnering with the engineers building in it.
  4. Vendor & third-party management — own key external relationships and integrations (e.g. LLM API providers, third-party services), today managed manually and informally. Bring structure, automation, and bus-factor resilience.
  5. FinOps — own cloud and platform cost visibility and efficiency, and the mechanics of how usage maps to billing.

Skills

Required

  • AWS
  • Kubernetes
  • MongoDB
  • Python
  • production operations experience
  • operations-and-reliability mindset
  • engineering mindset
  • incident management
  • automation
  • reliability engineering
  • vendor management
  • FinOps
  • scripting

Nice to have

  • vendor/cost management exposure
  • Temporal
  • observability tooling

What the JD emphasized

  • take real ownership
  • genuine ownership
  • build the automation and tooling
  • own domains end to end
  • risk and blast radius
  • Critical operational knowledge is documented and shared
  • Measurable reliability gains
  • High-risk manual processes are automated
  • Strong production operations experience
  • operations-and-reliability mindset
  • instinct to engineer the problem away
  • Sound judgement on incidents and risk
  • Calm and clear under pressure
  • Influences through relationships and evidence
  • comfortable owning a domain
  • partnering across teams