Software Engineer, ML Fleet Intelligence

Google Google · Big Tech · Sunnyvale, CA +1

Software Engineer, ML Fleet Intelligence at Google, focused on applying AI/ML to predict, detect, and mitigate hardware and software faults across a global fleet of data centers and ML TPUs. This role involves leveraging large-scale data, optimizing ML infrastructure, and building automated systems to ensure reliability and uptime.

What you'd actually do

  1. Lead the design and implementation of solutions in specialized ML areas, optimize ML infrastructure, and guide the development of model optimization and data processing strategies.
  2. Design and implement AI/ML models to predict, detect, and mitigate hardware and software faults across a global fleet.
  3. Analyze petabytes of telemetry and performance data to uncover insights that improve the reliability of ML TPUs and traditional compute infrastructure.
  4. Build scalable automated systems that allow Google’s data center footprint to grow while maintaining industry-leading uptime.
  5. Partner with hardware designers and site reliability engineers (SREs) to integrate intelligent diagnostics into the core data center lifecycle.

Skills

Required

  • Software development
  • Software testing
  • Software product launch
  • Software design
  • Software architecture
  • Speech/audio processing
  • Reinforcement learning
  • Machine learning infrastructure
  • ML model deployment
  • ML model evaluation
  • Data processing
  • Debugging
  • Fine tuning

Nice to have

  • Data structures
  • Algorithms
  • Technical leadership
  • Cross-functional project management
  • Predictive maintenance
  • Anomaly detection
  • Systems reliability engineering
  • Translating technical findings into business strategies

What the JD emphasized

  • 8 years of experience in software development
  • 5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture
  • 5 years of experience with one or more of the following: Speech/audio (e.g., technology duplicating and responding to the human voice), reinforcement learning (e.g., sequential decision making), Machine learning (ML) infrastructure, or specialization in another ML field.
  • 5 years of experience with ML design and ML infrastructure (e.g., model deployment, model evaluation, data processing, debugging, fine tuning)

Other signals

  • AI/ML for infrastructure fault prediction and mitigation
  • Leveraging petabytes of operational and telemetry data
  • Optimizing ML infrastructure and model optimization
  • Building scalable automated systems for data center growth and uptime