Performance & Reliability Engineer

Cerebras · Semiconductors · Headquarters +1 · Performance

The Performance & Reliability Engineer will characterize and optimize the performance and reliability of advanced ML hardware/software systems, focusing on reducing power and thermal fluctuations. This role involves analyzing ML workloads, software kernels, and hardware architecture, developing software solutions for reliability and performance, and influencing next-generation AI architecture design.

What you'd actually do

  1. Characterize and enhance the performance and reliability of advanced ML hardware/software systems, with emphasis on reducing power and thermal fluctuations.
  2. Analyze ML workloads, software kernels, and hardware architecture for power and performance impacts, and synthesize high-level insights across these layers.
  3. Develop creative software solutions to improve reliability and performance, collaborating cross-functionally to deploy these solutions in production.
  4. Influence the design of Cerebras' next-generation AI architecture and software stack through rigorous workload analysis and computational efficiency optimization.
  5. Partner with ML engineers, researchers, and reliability specialists to understand model behavior and drive system-level improvements from a software perspective.

Skills

Required

  • Python
  • C/C++
  • assembly programming
  • system-level performance and reliability optimization
  • BS, MS, or PhD in Computer Science, Electrical Engineering, or a related field
  • 3+ years of relevant experience in performance engineering, reliability, computer architecture, and/or software design

Nice to have

  • Hands-on experience with ML models, ML frameworks, and collective communication
  • Understanding of thermal management principles and power delivery for advanced semiconductors

What the JD emphasized

  • performance and reliability
  • ML workloads
  • AI architecture
  • computational efficiency

Other signals

  • performance optimization
  • reliability engineering
  • AI hardware/software systems
  • ML workloads
  • computational efficiency