Senior Production Engineer

Crusoe Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

Senior Production Engineer role focused on ensuring the reliability and scalability of Crusoe's AI-optimized cloud platform, with a strong emphasis on operating managed AI services, distributed AI pipelines, and inference services at scale. The role involves building automation, defining/improving SLIs/SLOs, and optimizing large-scale training and inference clusters.

What you'd actually do

  1. Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
  2. Build automation and reliability tooling to support distributed AI pipelines and inference services
  3. Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
  4. Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
  5. Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services

Skills

Required

  • Strong software engineering background
  • Demonstrated experience in distributed systems design and implementation
  • Hands-on work with large language models (LLMs) or AI/ML infrastructure
  • SRE mindset and experience
  • Defining and measuring SLIs/SLOs
  • Building monitoring and observability systems
  • Driving performance and reliability improvements
  • Designing fault-tolerant systems and automated testing strategies
  • Proficiency in at least one modern programming language (Python, Go, Java, C++)
  • Familiarity with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills
  • Ability to thrive in a fast-paced, mission-driven environment

Nice to have

  • Experience scaling inference or training workloads for LLMs

What the JD emphasized

  • strong background in distributed systems
  • hands-on experience with large language models
  • SRE mindset and experience
  • experience scaling inference or training workloads for LLMs

Other signals

  • operating managed AI services at scale
  • distributed AI pipelines and inference services
  • optimize large-scale training and inference clusters
  • performance tuning strategies for latency-sensitive AI services
  • reliability issues in distributed AI systems
  • next-generation distributed systems purpose-built for AI-first environments