Software Engineer Iii, Infrastructure, Cloud AI

Google Google · Big Tech · Sunnyvale, CA +1

Software Engineer III, Infrastructure, Cloud AI at Google to improve the XLA compiler stack for ML models on TPU, GPU, and CPU hardware. Focus on enhancing compiler stability, usability, and productionizing integration with ML frameworks, supporting internal ML teams and Google Cloud offerings.

What you'd actually do

  1. Write and test product or system development code.
  2. Understand how accelerator compilers and runtimes interact at a high level.
  3. Develop and apply metrics to understand the problem you are solving and gage status/success as needed.
  4. Close infrastructure (infra) gaps to help with ML stack maturation (e.g., reduce a number of ways something is done, improve reproducibility, improve tooling, improve usability).
  5. Participate in design reviews with peers and stakeholders to decide amongst available technologies.

Skills

Required

  • software development in C++
  • developing large-scale infrastructure, distributed systems or networks
  • compute technologies, storage or hardware architecture
  • testing, maintaining, or launching software products
  • software design and architecture

Nice to have

  • machine learning model training and serving
  • C++ development
  • working across or understanding different parts of the software stack (e.g., ML frameworks, compilers, ML runtimes, or systems)
  • compiler technology
  • ML runtime systems
  • low-level software optimization

What the JD emphasized

  • production serving
  • ML stack maturation
  • productionizing the integration of the XLA compiler and ML frameworks
  • critical for running machine learning models efficiently
  • standardize compiler interfaces and integration
  • improve stability
  • ensure model consistency between development and production
  • support most ML teams within Google
  • power Google Cloud's ML offerings
  • AI and Infrastructure team is redefining what’s possible
  • empower Google customers with breakthrough capabilities and insights
  • delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity
  • shaping the future of world-leading hyperscale computing
  • development of our TPUs
  • Vertex AI for Google Cloud
  • Google Global Networking
  • Data Center operations
  • systems research
  • low-level software optimization

Other signals

  • productionizing ML frameworks
  • ML stack maturation
  • Google Cloud's ML offerings