Software Engineer Iii, ML Networking

Google Google · Big Tech · Sunnyvale, CA +1

Software Engineer III, ML Networking at Google, focusing on building and optimizing the infrastructure for scaling Large Language Models (LLMs) and other Machine Learning (ML) workloads on GPUs. The role involves analyzing and resolving networking issues, optimizing performance for ML workloads, and contributing to the full stack optimization of ML networking on Google's infrastructure.

What you'd actually do

  1. Analyze the networking issues associated with the next generations of GPU hardware, and design, build, and deploy whatever is needed to make them work optimally in our data centers.
  2. Achieve workload optimal performance NVIDIA Collective Communications Library (NCCL) + Graphics Processing Unit (GPU).
  3. Compile a comprehensive analysis of performance across different GPU and network generations.
  4. Determine how customers' ML models will evolve once we have 72 node-NVLink domains.
  5. Execute full stack optimization for ML networking performance on Google's infrastructure, this spans a wide range, from kernel optimization, user space communication libraries.

Skills

Required

  • software development
  • large-scale infrastructure
  • distributed systems
  • networks
  • compute technologies
  • storage
  • hardware architecture
  • networking

Nice to have

  • data structures
  • algorithms

What the JD emphasized

  • ML networking performance

Other signals

  • LLM infrastructure
  • GPU scaling
  • networking optimization