Software Engineer, Kernel Reliability

Cerebras · Semiconductors · Headquarters +2 · Software

Software engineer to join the Kernel Reliability team, focusing on improving the reliability of Cerebras' AI compute clusters and underlying inference, training, and internal production services. The role involves working closely with code, designing scalable solutions, and debugging complex issues.

What you'd actually do

  1. Contribute to the technical roadmap and execution for kernel-centric reliability of our internal and customer-facing systems.
  2. Partner with System and Cluster Operations teams to reduce system and service downtime after failure through tooling, analysis, and hands-on debugging support.
  3. Work with the Debug Team to enhance debug tools with the goal of speeding up failure analysis.
  4. Collaborate with software teams to improve the software stack—including kernels—to improve on-field debugging and failure analysis.
  5. Work with ASIC and hardware architecture teams to co-design next-generation architectures with reliability and ease of debug in mind.

Skills

Required

  • C/C++
  • Python
  • operating systems
  • computer architecture
  • systems programming fundamentals
  • debug complex issues using logs, traces, and standard debugging workflows
  • root-cause analysis

Nice to have

  • parallel and distributed programming
  • message passing
  • multicore
  • GPU
  • embedded
  • debug/diagnostic tools
  • debuggers
  • core dump handling
  • tracing
  • sanitizers
  • profilers
  • debugging distributed and parallel applications
  • deadlocks
  • livelocks
  • race conditions
  • instruction pipelining
  • multithreading
  • networking
  • memory systems
  • monitoring
  • incident response
  • post-mortem culture

What the JD emphasized

  • deeply technical
  • hands-on software engineer
  • critical challenge
  • improving the reliability
  • advanced compute clusters
  • inference, training, and internal production services
  • work close to the code
  • design solutions that will scale
  • rapidly growing system production and software service offerings
  • strong fundamentals in systems, debugging, and failure analysis
  • building tools and solving hard reliability problems
  • Required