Senior Software Engineer, Kernels and Performance, Core ML Frameworks

Google Google · Big Tech · Sunnyvale, CA +1

Senior Software Engineer focused on optimizing ML kernels and infrastructure for TPUs and GPUs, impacting both training and inference performance for Google's AI models and cloud customers. This role involves deep technical work on low-level languages and collaboration with ML researchers and framework developers.

What you'd actually do

  1. Design and optimize high-performance kernels (using languages like Pallas, Mosaic, and Triton) targeting Tensor Processing Unit (TPU) and Graphics Processing Unit (GPU) architectures for critical Machine Learning (ML) operations, redefining what’s possible from massive training runs to high-speed inference.
  2. Architect infrastructure such as benchmarking suites, autotuning frameworks, performance analysis tools, regression testing, and documentation, transforming how the developer community interacts with increasingly critical custom kernels in key Open-Source Software (OSS) libraries.
  3. Track the latest advancements in hardware architectures, compiler technologies, and AI models to identify new opportunities for performance optimization through custom kernels.
  4. Engage with ML researchers, framework developers (Just After eXecution (JAX), PyTorch), and compiler engineers (Accelerated Linear Algebra (XLA)) to enhance adoption, identify new requirements, and address bottlenecks by providing appropriate solutions.

Skills

Required

  • software development in C++ or Python
  • testing, maintaining, or launching software products
  • software design and architecture
  • performance optimization

Nice to have

  • optimizing TPU/GPU code
  • low-level kernel languages like Pallas, Compute Unified Device Architecture (CUDA), or Triton
  • ML Frameworks (JAX/PyTorch)
  • common operations like attention and Mixture of Experts (MoEs)
  • model optimization and low-precision formats
  • modern accelerators (e.g., data movement, pipelining, heterogeneous compute, and scale-out)
  • compiler principles (optimization, code generation)
  • toolchains such as MLIR, OpenXLA
  • building developer infrastructure
  • Open-Source Software (OSS) libraries
  • flexible high-performance APIs
  • easy-to-consume documentation
  • investigative and problem-solving capabilities
  • communication skills across cross-functional teams

What the JD emphasized

  • performance optimization
  • TPU/GPU code
  • ML Frameworks
  • compiler principles

Other signals

  • optimize kernels for TPUs and GPUs
  • ML operations
  • training and inference performance
  • developer infrastructure
  • ML frameworks