Senior Software Engineer, Tpu, AI Infrastructure

Google Google · Big Tech · Taipei, Taiwan

Senior Software Engineer role focused on developing firmware and low-level software for Google's custom AI accelerators (TPUs). The role involves designing, building, and debugging hardware-software interfaces, firmware, simulators, and telemetry systems for ASICs, working closely with chip design and ML infrastructure teams to enable large-scale AI model deployment and operation.

What you'd actually do

  1. Design and build firmware running on embedded micro-controllers with limited memory footprints on the accelerator Application-Specific Integrated Circuits (ASIC).
  2. Co-design hardware/software interface, and work with the hardware design and development teams.
  3. Design and develop tools to update and debug ASIC firmware, and enable chip bring-up and hardware debugging.
  4. Build functional or cycle-level simulators that bit-accurately model the custom accelerator ASICs, build tools and infrastructure to help ASIC design verification, tapeout, and bring-up, and develop embedded CPU simulators as part of the full system simulator.
  5. Architect and design debuggability mechanisms and telemetry collection systems to monitor Tensor Processing Units (TPUs), enhancing customer satisfaction and enabling rapid response, diagnosis, and mitigation of production failures.

Skills

Required

  • software development in C++
  • embedded operating systems
  • software design and architecture
  • testing, maintaining, or launching software products

Nice to have

  • Embedded software development in C/C++
  • machine learning (ML)
  • hardware/software co-design at the chip-level
  • architecting scalable software
  • multi-threaded designs

What the JD emphasized

  • TPUs for machine learning
  • custom accelerators (ASICs)
  • debug ASIC firmware
  • hardware debugging
  • ASIC design verification
  • tapeout
  • bring-up
  • monitor Tensor Processing Units (TPUs)
  • production failures

Other signals

  • Develop C++ code that controls and monitors Google’s custom accelerators (ASICs) - including TPUs for machine learning
  • Define the API that the rest of the software stack uses to build deployments of the systems that use these ASICs
  • Debug and bring up new ASICs
  • Work closely with external vendors and many internal teams including chip design, system software, ML supercomputer, compiler, and system test
  • Design and build firmware running on embedded micro-controllers with limited memory footprints on the accelerator Application-Specific Integrated Circuits (ASIC)
  • Co-design hardware/software interface, and work with the hardware design and development teams
  • Design and develop tools to update and debug ASIC firmware, and enable chip bring-up and hardware debugging
  • Build functional or cycle-level simulators that bit-accurately model the custom accelerator ASICs, build tools and infrastructure to help ASIC design verification, tapeout, and bring-up, and develop embedded CPU simulators as part of the full system simulator
  • Architect and design debuggability mechanisms and telemetry collection systems to monitor Tensor Processing Units (TPUs), enhancing customer satisfaction and enabling rapid response, diagnosis, and mitigation of production failures