Technical Lead, Ai/ml Infrastructure

Google Google · Big Tech · Sunnyvale, CA +1

This role focuses on building and delivering specialized AI compute platforms and infrastructure at Google, ensuring high availability for massive-scale AI workloads. It involves designing and integrating custom compute topologies with schedulers like Kubernetes, establishing observability, and collaborating with hardware teams. The role requires extensive experience in large-scale infrastructure, distributed systems, and software development.

What you'd actually do

  1. Lead the design and end-to-end software delivery of specialized AI compute platforms, ensuring high availability for massive-scale workloads.
  2. Establish observability and hardening strategies for the lower-half software stack, including firmware, and hardware qualification.
  3. Architect robust integration interfaces between custom compute topologies and industry-standard workload schedulers like Kubernetes and GKE.
  4. Architect secure boot and cryptographic remote attestation flows for distributed hardware platforms.
  5. Partner closely with hardware engineering and chip design teams to influence future architectures for large-scale deployment.

Skills

Required

  • C, C++, Go, or Python
  • testing, and launching software products
  • building and developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies, storage, or hardware architecture
  • designing, building, and operating large-scale distributed systems, high-performance networking stacks, or operating system internals

Nice to have

  • Master’s degree or PhD in Engineering, Computer Science, or a related technical field
  • data structures/algorithms
  • technical leadership role leading project teams and setting technical direction
  • working in a complex, matrixed organization involving cross-functional, or cross-business projects
  • lower-half server architectures, hardware-adjacent orchestration, and low-level security implementations
  • Kubernetes and Google-internal cluster systems
  • build telemetry pipelines and monitoring systems for distributed hardware

What the JD emphasized

  • large-scale infrastructure
  • distributed systems
  • high-performance networking stacks
  • operating system internals
  • lower-half server architectures
  • hardware-adjacent orchestration
  • low-level security implementations