Principal Engineer, AI Inference Reliability

Cerebras Cerebras · Semiconductors · US and Canada Offices · Software

Principal Engineer focused on ensuring the reliability of Cerebras' AI inference services, which leverage their large-scale AI chips for high-speed inference. The role involves defining reliability strategies, implementing mechanisms for fault tolerance, leading incident management, and developing reliability tooling for distributed systems.

What you'd actually do

  1. Define and drive reliability strategy: establish SLOs and ensure alignment across engineering.
  2. Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers.
  3. Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents.
  4. Architect for reliability and observability: influence system design for redundancy, durability, and debuggability.
  5. Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection.

Skills

Required

  • backend, infrastructure, or reliability engineering for large-scale distributed systems
  • Python, C++, Go, or Rust
  • reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture
  • communication and cross-functional leadership skills

Nice to have

  • building large-scale AI infrastructure systems

What the JD emphasized

  • world-class reliability standards
  • low-latency
  • high-reliability distributed systems
  • reliability principles
  • SLO/SLI/SLA design
  • incident response
  • postmortem culture

Other signals

  • AI inference reliability
  • low-latency
  • high-reliability distributed systems
  • SLOs
  • incident management