Prognostics & Health Monitoring Engineer

Cerebras · Semiconductors · Headquarters +1 · Quality and Reliability

This role focuses on building a prognostics and health monitoring (PHM) capability for Cerebras' AI hardware and systems. The engineer will develop frameworks to monitor, assess, and predict hardware health, transforming telemetry data into actionable insights for early detection of degradation and proactive failure prediction to ensure system availability and performance. It involves reliability engineering, data science, and system software integration.

What you'd actually do

  1. Define the vision, architecture, and roadmap for PHM across deployed systems
  2. Design and scale frameworks for health assessment, anomaly detection, and predictive failure modeling
  3. Develop and productionize probabilistic models for failure risk, degradation, and remaining useful life
  4. Analyze large-scale telemetry, logs, and service data to identify systemic drivers of failures and disruptions
  5. Establish health metrics, scoring systems, and fleet-level observability to communicate system risk

Skills

Required

  • Bachelor’s or Master’s in Engineering, Computer Science, Data Science, or related field
  • 8+ years in reliability engineering, data science, fleet analytics, or similar
  • Strong Python and SQL for large-scale data analysis and modeling
  • Experience building and deploying predictive models in production
  • Expertise in applied statistics and probabilistic modeling (e.g., survival analysis, hazard models, Bayesian methods)
  • Experience with large-scale telemetry or distributed system datasets
  • Proven ability to define ambiguous problems and deliver scalable solutions

Nice to have

  • Experience with HPC systems, AI infrastructure, or datacenter environments
  • Background in PHM, predictive maintenance, or reliability analytics at scale
  • Familiarity with RUL estimation and degradation modeling
  • Understanding of observability systems, telemetry pipelines, and real-time monitoring
  • Background in hardware reliability and failure modes in complex systems

What the JD emphasized

  • 8+ years in reliability engineering, data science, fleet analytics, or similar
  • Proven ability to define ambiguous problems and deliver scalable solutions
  • Experience building and deploying predictive models in production
  • Expertise in applied statistics and probabilistic modeling

Other signals

  • AI chip
  • AI compute power
  • AI applications
  • Generative AI inference
  • agentic computation