Staff Software Engineer, Emerging On-prem AI Infrastructure

Google Google · Big Tech · Kirkland, WA +3

Staff Software Engineer focused on building and optimizing AI infrastructure, including large-scale training and inference workloads, AI clusters, and AI acceleration hardware/networking. The role involves driving technical roadmaps, ensuring end-to-end supportability, and implementing success metrics for the team.

What you'd actually do

  1. Drive project success by setting the technical goal and roadmap.
  2. Set priorities and projects for a team that delivers features in a changing environment for both internal customers (other engineering teams) and external customers.
  3. Ensure central responsibility is taken for diagnostics and troubleshooting of end-to-end supportability issues, to uncover and address technical problems, and the building of repair automation systems.
  4. Implement and govern the success metrics for the team, spanning Operational Plane metrics (e.g., Support case metrics, GSO case handling), and RMA/Spares metrics (e.g., swap and repair rate).

Skills

Required

  • C++
  • software design and architecture
  • large-scale infrastructure
  • distributed systems
  • networks
  • compute technologies
  • storage
  • hardware architecture

Nice to have

  • cloud infrastructure
  • systems level infrastructure
  • hardware and software stack
  • end-to-end diagnostics
  • troubleshooting
  • supportability
  • SWAT team efforts
  • complex issues
  • long term sustainable solutions
  • Service Level Objectives (SLOs)/metrics measurement
  • logs/telemetry/metrics integration
  • operator experience
  • low-level system software
  • OS
  • firmware
  • low level networking
  • hardware
  • building system skills
  • changing environment
  • navigate ambiguity
  • delivering solutions
  • subtle or complex technical problems

What the JD emphasized

  • 8 years of experience programming in C++
  • 5 years of experience testing, and launching software products
  • 5 years of experience building and developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies, storage, or hardware architecture
  • 3 years of experience with software design and architecture

Other signals

  • AI infrastructure
  • large-scale training and inference workloads
  • optimizing performance
  • building large AI clusters
  • AI acceleration
  • cluster interconnects and networking