Performance & Reliability Engineer

Cerebras Cerebras · Semiconductors · Toronto, ON · Software

The role focuses on characterizing and optimizing the performance and reliability of advanced ML models running on Cerebras' AI hardware, with an emphasis on reducing power and thermal fluctuations. The engineer will analyze ML workloads, software kernels, and hardware architecture to improve system-level performance and reliability, influencing next-generation AI architecture design.

What you'd actually do

  1. Characterize and enhance the performance and reliability of advanced ML hardware/software systems, with emphasis on reducing power and thermal fluctuations.
  2. Analyze ML workloads, software kernels, and hardware architecture for power and performance impacts, and synthesize high-level insights across these layers.
  3. Develop creative software solutions to improve reliability and performance, collaborating cross-functionally to deploy these solutions in production.
  4. Influence the design of Cerebras' next-generation AI architecture and software stack through rigorous workload analysis and computational efficiency optimization.
  5. Partner with ML engineers, researchers, and reliability specialists to understand model behavior and drive system-level improvements from a software perspective.

Skills

Required

  • Python
  • C/C++
  • assembly programming
  • system-level performance and reliability optimization

Nice to have

  • ML models
  • ML frameworks
  • collective communication
  • thermal management principles
  • power delivery for advanced semiconductors

What the JD emphasized

  • performance engineering
  • reliability
  • computer architecture
  • software design
  • system-level performance and reliability optimization

Other signals

  • performance optimization
  • reliability engineering
  • ML hardware/software systems
  • AI chip
  • inference speeds