Software Engineer, ML Fleet Intelligence

Google Google · Big Tech · Sunnyvale, CA +1

Software Engineer role focused on applying AI/ML to optimize the fault tolerance and reliability of Google's global data center fleet, including ML TPUs. The role involves designing and implementing ML models for fault prediction and mitigation, analyzing large-scale telemetry data, and building scalable automated systems. It requires experience in ML infrastructure, model deployment, and data processing, with a focus on improving the reliability of AI/ML systems and traditional compute infrastructure.

What you'd actually do

  1. Lead the design and implementation of solutions in specialized ML areas, optimize ML infrastructure, and guide the development of model optimization and data processing strategies.
  2. Design and implement AI/ML models to predict, detect, and mitigate hardware and software faults across a global fleet.
  3. Analyze petabytes of telemetry and performance data to uncover insights that improve the reliability of ML TPUs and traditional compute infrastructure.
  4. Build scalable automated systems that allow Google’s data center footprint to grow while maintaining industry-leading uptime.
  5. Partner with hardware designers and site reliability engineers (SREs) to integrate intelligent diagnostics into the core data center lifecycle.

Skills

Required

  • Software development
  • Software products testing and launching
  • Software design and architecture
  • Speech/audio technology
  • Reinforcement learning
  • Machine learning (ML) infrastructure
  • ML design
  • ML infrastructure
  • Model deployment
  • Model evaluation
  • Data processing
  • Debugging
  • Fine tuning
  • Data structures
  • Algorithms

Nice to have

  • Predictive maintenance
  • Anomaly detection
  • Systems reliability engineering
  • Technical leadership
  • Cross-functional project experience

What the JD emphasized

  • 5 years of experience with one or more of the following: Speech/audio (e.g., technology duplicating and responding to the human voice), reinforcement learning (e.g., sequential decision making), Machine learning (ML) infrastructure, or specialization in another ML field.
  • 5 years of experience with ML design and ML infrastructure (e.g., model deployment, model evaluation, data processing, debugging, fine tuning).

Other signals

  • AI/ML for infrastructure optimization
  • Predictive maintenance
  • Anomaly detection
  • Large-scale data analysis
  • Scalable automated systems