Senior Software Engineer, Machine Health

Google Google · Big Tech · Sunnyvale, CA +1

This role focuses on automating the detection, mitigation, and repair of hardware in data centers, ensuring the reliable and cost-effective operation of machines (ML, Compute, Storage) for Google's internal and external customers. The responsibilities include managing the entire machine lifecycle, from deployment to decommissioning, and supporting repair workflows for various hardware components.

What you'd actually do

  1. Manage the machines life-cycle right from the moment the machine enters the data center floor.
  2. Own and manage the workflows that help get it to serve customers needs by turning it up, mitigate, repair and upgrade the machines when needed, and finally decommissioning machines when end-of-life.
  3. Manage through large distributed systems reliably and safely and ensure the ML, compute and storage capacity to all the Google products and cloud customers.
  4. Own and support workflows that provide a repair platform for both machine, networking and power/cooling devices on the data center floor.

Skills

Required

  • software development
  • programming
  • C++
  • Large Scale Distributed Systems
  • data analysis

Nice to have

  • Python
  • databases
  • SQL
  • data visualization
  • hardware