Engineering Manager, Kernel Reliability

Cerebras · Semiconductors · Headquarters +2 · Performance

Cerebras Systems is seeking an Engineering Manager for their Kernel Reliability team. This role focuses on improving the reliability of their AI compute clusters, inference, training, and internal production services. The manager will provide technical leadership, own the roadmap, and work on tooling for failure analysis and diagnostics. The position requires expertise in software/hardware reliability, parallel/distributed programming, and debugging tools, with experience leading engineering teams.

What you'd actually do

  1. Provide hands-on technical leadership, owning the technical vision and roadmap for the kernel-centric reliability of our internal and customer-facing systems
  2. Assist System and Cluster Operations teams on reducing system and service downtime after failure by providing tooling and manual intervention for failure analysis and diagnostic
  3. Work with the Debug Team to enhance debug tools with the goal of speeding up failure analysis
  4. Collaborate with SW teams to improve the software stack, including Kernels, to improve on-field debugging and failure analysis
  5. Work with the ASIC an HW architecture teams to codesign the next generation architectures with reliability and ease of debug in mind

Skills

Required

  • software engineering
  • leading teams in SW/HW reliability
  • debug
  • diagnostic
  • failure analysis
  • parallel and distributed programming
  • debug and diagnostic tool development or expert usage
  • debugging distributed and parallel applications
  • computer architectures
  • monitoring and reliability engineering
  • incident response
  • post-mortem analysis
  • recruit and retain high-performing teams
  • mentor engineers
  • partner cross-functionally

Nice to have

  • GPU
  • embeded

What the JD emphasized

  • deeply technical, hands-on engineering leader
  • improving the reliability of our advanced compute clusters and the underlying inference, training, and internal production services
  • set the technical vision while staying close to the code
  • proven expertise in software or hardware reliability, diagnostic tool building, or failure analysis and debugging
  • 6+ years in software engineering, with 3+ years leading teams in SW/HW reliability, debug, diagnostic, failure analysis or related fields
  • Expertise in parallel and distributed programming
  • debug and diagnostic tool development or expert usage
  • debugging distributed and parallel applications
  • deep understanding of computer architectures
  • Strong background in monitoring and reliability engineering