Senior Engineering Manager, Compute

Temporal Temporal · Enterprise · United States · Engineering

Senior Engineering Manager to lead the Compute team at Temporal, focusing on building and operating the invisible compute substrate for AI workloads. The role involves strategic direction, technical leadership, roadmap development, and operational excellence for a large-scale, multi-tenant compute platform that powers AI-native companies.

What you'd actually do

  1. Own the strategy and standards of excellence for the compute layer that the world's agents run on, across design, delivery, and operations. Build a culture of ownership, quality, and customer-first decision-making.
  2. Lead, hire, and grow a high-ownership team; roll up sleeves, ready to do deep into the trenches, by staying close to design docs and code, rather than managing from a distance. Coach engineers, level them up, and clear the friction that slows them down.
  3. Drive the arc from today's compute toward the next-generation of compute platforms. Ground prioritization in customer and design-partner feedback, and turn ambiguous, fast-moving requirements into predictable, iterative delivery.
  4. When you run frontier AI in production, reliability _is_ the product. Own operations, run on-call and incident response, and drive blameless postmortems and the systemic fixes that prevent recurrence.
  5. Guide the hard architectural decisions for large-scale, multi-tenant compute, where technical concerns cut across workload isolation and security, scheduling, fleet efficiency / utilization / goodput, and performance, while ensuring the platform is reliable and efficient for the workloads that depend on it.

Skills

Required

  • Strong leadership, coaching, and performance management; ability to grow engineers and build a healthy, accountable, high-ownership team.
  • Excellence in execution: planning, prioritization, and delivering iterative milestones in an ambiguous, fast-moving environment while managing unplanned work.
  • Fleet thinking: utilization, goodput, capacity and supply planning, and cost discipline as first-class engineering concerns.
  • Live-site reliability craft: on-call, incident management & response, and postmortem-driven continuous improvement.
  • Strong command of the building blocks of a compute platform: multi-tenant isolation and security, scheduling, and resource management.
  • Ability to review and raise the bar on technical artifacts (design docs, code reviews) across a distributed-systems codebase.

Nice to have

  • MicroVMs and virtualization (Firecracker, gVisor, Edera) or managed-compute primitives (AWS Fargate, GCP Cloud Run, AWS Lambda), and/or Kubernetes internals.
  • Building serverless or hosted-compute products from 0 to 1, including the rapid-delivery-vs-durable-platform tradeoffs that

What the JD emphasized

  • world's agents run on
  • compute substrate that the world's most demanding AI workloads will run on
  • operated compute at planet scale
  • fleet efficiency / utilization / goodput
  • cost-per-unit-of-compute
  • reliability _is_ the product
  • multi-tenant compute that other people's production workloads depend on

Other signals

  • AI companies run on Temporal
  • compute substrate for AI workloads
  • AI-native companies executed 1.86 trillion actions on Temporal Cloud
  • AI revolution run on Temporal