Software Engineer, Inference Infrastructure

Tesla Tesla · Auto · Palo Alto, CA · Tesla AI

The role focuses on building and scaling the inference infrastructure for AI models on custom AI hardware. This includes owning the AI inference cluster, developing job scheduling and cluster management systems, designing inference pipelines for validation and deployment, and creating developer tooling for model validation and debugging. The position requires strong backend engineering fundamentals, experience with hardware accelerator infrastructure, and familiarity with ML inference workloads.

What you'd actually do

  1. Own & scale the AI inference cluster — the physical and software platform that runs AI workloads on custom AI hardware across thousands of boards
  2. Build & improve job scheduling, hardware onboarding, and cluster self-healing systems that keep the fleet running at 95%+ uptime
  3. Design & implement inference pipelines that unify evals, sims, rollouts, and visualizations across AI and Optimus teams
  4. Build developer tooling that makes compiler-produced artifacts easy to run, validate, and debug on real hardware at scale
  5. Contribute to flashing, inventory management, and fleet management infrastructure for different hardware generations

Skills

Required

  • Strong backend engineering fundamentals
  • distributed systems
  • job orchestration
  • reliability
  • scale
  • Experience with hardware accelerator infrastructure
  • TPUs, custom AI chips, or similar
  • manage a large fleet of accelerators
  • keep them healthy/utilized
  • Familiarity with cluster orchestration
  • Kubernetes, SLURM, or similar bare metal & containerized environments
  • Proficiency in Python
  • Experience with low-level systems concepts
  • networking
  • file systems
  • process management
  • Familiarity with ML inference workloads
  • what makes them fast or slow at scale
  • Strong ownership mindset
  • comfortable navigating ambiguous problems
  • diving into unfamiliar codebases
  • driving things to completion without hand-holding

Nice to have

  • PyTorch
  • Go or C++
  • CI/CD pipelines for hardware-in-the-loop validation
  • fleet management or device provisioning at scale
  • gRPC
  • distributed task queues
  • high-throughput data pipelines
  • MLIR or compiler toolchains

What the JD emphasized

  • Own & scale the AI inference cluster
  • Build & improve job scheduling, hardware onboarding, and cluster self-healing systems
  • Design & implement inference pipelines
  • Build developer tooling
  • Strong ownership mindset

Other signals

  • AI inference cluster
  • AI hardware
  • inference pipelines
  • developer tooling
  • ML inference workloads