Software Engineer, Tpu Software Systems, Cloud

Google Google · Big Tech · Sunnyvale, CA +1

Software Engineer role focused on designing, developing, and maintaining the software infrastructure for Google's TPU supercomputers, specifically tailored to support massive-scale machine learning applications and enable efficient execution of data parallelism algorithms.

What you'd actually do

  1. Design and maintain TPU supercomputer software across multiple stack layers, ranging from daemons on host machines to network routing rules embedded directly into the TPUs.
  2. Develop and manage control software on specialized machines and distributed infrastructure to support the operation of massive collections of networked hardware.
  3. Implement robust systems to monitor, deploy, qualify, and service supercomputing systems, ensuring they remain reliable and performant at scale.
  4. Engineer software solutions for the reliable scale-out and scale-up of accelerators, specifically tailored to meet the needs of massive-scale machine learning applications.
  5. Architect and build software to optimally interconnect TPUs, enabling efficient execution of data parallelism algorithms like ring all-reduce.

Skills

Required

  • software development
  • large-scale infrastructure
  • distributed systems
  • networks
  • compute technologies
  • storage
  • hardware architecture

Nice to have

  • data structures
  • algorithms
  • Generative AI models
  • frameworks
  • APIs
  • machine learning models
  • infrastructure
  • production environments
  • network infrastructure
  • distributed systems

What the JD emphasized

  • massive-scale machine learning applications

Other signals

  • TPU supercomputer software
  • massive-scale machine learning applications
  • data parallelism algorithms