Senior ML Software Engineer - Integration & Quality

Cerebras Cerebras · Semiconductors · Headquarters +2 · Software

Senior ML Software Engineer focused on integrating and validating the software stack for the Cerebras AI platform, ensuring reliable and efficient execution of large-scale ML workloads. This role involves debugging complex distributed systems, improving automation, and enhancing the reliability of AI infrastructure, working closely with runtime, compiler, kernel, and hardware teams.

What you'd actually do

  1. Integrate and validate software components across the Cerebras AI platform.
  2. Collaborate with engineers across ML runtime, compiler, kernel, and hardware teams to ensure reliable feature integration.
  3. Investigate and debug complex issues across distributed systems and large-scale ML workloads.
  4. Build automation tools and infrastructure to support integration testing, system validation, and debugging workflows.
  5. Develop and maintain testbeds used to validate system performance and reliability.

Skills

Required

  • Python
  • C++
  • Go
  • systems-level development
  • infrastructure tooling
  • platform integration
  • automation tools
  • testing frameworks
  • internal developer tooling
  • problem-solving
  • collaboration

Nice to have

  • machine learning infrastructure
  • ML model deployment
  • LLM
  • multimodal model workloads
  • distributed systems
  • cloud infrastructure
  • large-scale compute clusters
  • performance debugging
  • profiling
  • system observability tools
  • microservices
  • containerized environments
  • cluster orchestration
  • hardware accelerators
  • compilers
  • ML frameworks

What the JD emphasized

  • debug complex issues
  • improve automation
  • strengthen the reliability

Other signals

  • integrating and validating the software stack that powers the Cerebras AI platform
  • ensuring large-scale ML workloads run reliably and efficiently
  • debug complex issues
  • improve automation
  • strengthen the reliability of our AI infrastructure