Principal Engineer, AI Inference Reliability

Cerebras · Semiconductors · Headquarters +2 · Remote · AI Cloud

Principal Engineer, AI Inference Reliability at Cerebras, focusing on ensuring the reliability, performance, and security of their large-scale AI inference services built on wafer-scale architecture. The role involves defining reliability strategy, implementing mechanisms for fault tolerance, leading incident management, and collaborating across engineering teams to meet world-class reliability standards.

What you'd actually do

  1. Define and drive reliability strategy: establish SLOs and ensure alignment across engineering.
  2. Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers.
  3. Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents.
  4. Architect for reliability and observability: influence system design for redundancy, durability, and debuggability.
  5. Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection.

Skills

Required

  • 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems.
  • Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust.
  • Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture.
  • Excellent communication and cross-functional leadership skills.

Nice to have

  • prior experience building large-scale AI infrastructure systems.

What the JD emphasized

  • world's most performant, secure, and reliable enterprise-grade AI service
  • unprecedented speed and efficiency
  • scale inference and accelerate AI
  • hands-on Reliability Tech Lead (IC) to own the mission of making Cerebras Inference the most reliable AI service in the world
  • define SLOs and incident-response frameworks
  • design and implement reliability mechanisms at scale
  • partner across hundreds of engineers to ensure our service meets world-class reliability standards
  • building and operating massive-scale, low-latency, high-reliability distributed systems
  • Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture.

Other signals

  • AI inference reliability
  • large-scale distributed systems
  • low-latency
  • high-reliability