Tech Lead Manager, Kubernetes AI Infrastructure

Google Google · Big Tech · Kirkland, WA +2

Tech Lead Manager for Kubernetes AI Infrastructure, responsible for building and managing reliable and scalable TPU orchestration with Kubernetes to enable customers to run GenAI workloads at scale on-premises. The role involves technical leadership, people management, and collaboration with AI labs to influence the infrastructure roadmap.

What you'd actually do

  1. Design, guide and vet systems designs within the scope of the broader area, and write system development code to solve ambiguous problems.
  2. Design, develop, and maintain Kubernetes-based systems to manage large-scale TPU infrastructure for on-premises and hybrid environments.
  3. Oversee a team of software engineers specializing in distributed systems and AI infra, while fostering a high-performance and collaborative team environment.
  4. Collaborate with major frontier AI Labs to influence the AI infrastructure roadmap and promote the use of TPUs for advanced ML workloads.
  5. Work with cross-functional partners to deploy new infrastructure management tools that enhance Kubernetes' ability to handle large-scale GenAI tasks.

Skills

Required

  • software development in one or more programming languages (e.g., Python, C, C++, Java, JavaScript)
  • technical leadership role
  • people management or team leadership role
  • designing and implementing large-scale distributed systems

Nice to have

  • deep learning frameworks (e.g., PyTorch, JAX)
  • LLM applications and tools (e.g., SLURM, Ray)
  • machine learning infra (e.g., GPU, Cloud TPU, etc.)

What the JD emphasized

  • Kubernetes-based systems to manage large-scale TPU infrastructure
  • large-scale GenAI tasks
  • machine learning infra (e.g., GPU, Cloud TPU, etc.)

Other signals

  • building reliable and scalable TPU orchestration with Kubernetes
  • enable customers to effectively and reliably run their GenAI workloads by leveraging Google TPU infra at scale
  • on-premises AI supercomputer experience with K8s
  • machine learning infra (e.g., GPU, Cloud TPU, etc.)