ML Software Tool Development Engineer

Cerebras · Semiconductors · US and Canada Offices · Software

ML Software Tool Development Engineer at Cerebras, focusing on building debugging, validation, and observability platforms for AI systems, including compilers, runtimes, and hardware interfaces. The role involves developing automated systems for anomaly detection, root-cause analysis, and visualization tools to support large-scale ML applications and inference.

What you'd actually do

  1. Lead the design and implementation of system-level debugging, validation, and observability platforms.
  2. Develop automated systems for collecting and analyzing numerical, and execution anomalies.
  3. Create visualization and analysis tools to enable efficient root-cause investigation.
  4. Build frameworks for failure classification, regression detection, and anomaly monitoring.
  5. Extend compilers, runtimes, and programming interfaces to support advanced profiling and instrumentation.

Skills

Required

  • C++
  • Python
  • system-level debugging
  • validation
  • observability platforms
  • automated systems development
  • data analysis
  • visualization tools
  • failure classification
  • regression detection
  • anomaly monitoring
  • compiler internals
  • runtime development
  • programming interfaces
  • profiling
  • instrumentation
  • system bring-up
  • low-level debug
  • validation workflows
  • cross-functional collaboration
  • best practices for debuggability
  • reliability
  • operational excellence
  • incident response
  • corrective actions

Nice to have

  • machine learning training and inference pipelines
  • distributed training
  • large-model scaling
  • high-performance clusters
  • HPC systems
  • custom hardware/software co-design

What the JD emphasized

  • building reliable, high-performance systems and tooling
  • debugging complex hardware/software systems
  • analyzing system-level data structures, execution graphs, or dependency networks
  • design and build intuitive visualization and analysis tools
  • compiler internals
  • custom hardware interfaces
  • low-level protocol design

Other signals

  • Develop automated systems for collecting and analyzing numerical, and execution anomalies.
  • Create visualization and analysis tools to enable efficient root-cause investigation.
  • Build frameworks for failure classification, regression detection, and anomaly monitoring.
  • Extend compilers, runtimes, and programming interfaces to support advanced profiling and instrumentation.