Staff Software Engineer, Network Health

Google Google · Big Tech · Sunnyvale, CA +1

This role focuses on engineering the network infrastructure that supports large-scale AI/ML training and serving at Google. The engineer will define the roadmap for repair automation, design agentic diagnostic systems using Generative AI, integrate new hardware, lead safety initiatives like anomaly detection, and mentor junior engineers. The role requires experience with ML infrastructure, generative AI tools, and software development.

What you'd actually do

  1. Define the long-term goal for repair automation of AI/ML infrastructure, focusing on achieving goals through multiple parallel programs.
  2. Lead and participate in the design of agentic diagnostic systems that utilize Generative AI to automate diagnoses for next-gen networks.
  3. Work with platform teams to integrate new hardware platforms into the automation ecosystem, driving the qualification and repair workflows required for global fleet turn-up.
  4. Lead critical safety initiatives, such as automated anomaly detection, to protect fleet health and capacity.
  5. Mentor a team of junior and executive engineers and influence engineering practices across the broader infrastructure organization to drive consistency in automation and safety standards.

Skills

Required

  • software development
  • software products
  • software design and architecture
  • Speech/audio
  • reinforcement learning
  • ML infrastructure
  • ML design
  • model deployment
  • model evaluation
  • data processing
  • debugging
  • fine tuning
  • generative AI tools
  • LLM interfaces

Nice to have

  • technical leadership
  • SQL Pipelines
  • Plx Scripts
  • Generative AI Agents
  • complex infrastructure projects
  • influence technical direction
  • repair infrastructure
  • network
  • machines

What the JD emphasized

  • 8 years of experience in software development
  • 5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture
  • 5 years of experience with one or more of the following: Speech/audio (e.g., technology duplicating and responding to the human voice), reinforcement learning (e.g., sequential decision making), ML infrastructure, or specialization in another ML field.
  • 5 years of experience with ML design and ML infrastructure (e.g., model deployment, model evaluation, data processing, debugging, fine tuning).
  • Experience integrating generative AI tools or LLM interfaces into workflows.

Other signals

  • AI/ML infrastructure
  • large-scale training
  • network infrastructure
  • agentic diagnostic systems
  • Generative AI
  • LLM interfaces
  • automated anomaly detection