Senior Deep Learning Compiler Engineer - Xla

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +5 · Remote

Senior Deep Learning Compiler Engineer focused on optimizing inference and training performance for JAX and OpenXLA on NVIDIA GPUs. Develops compiler optimization algorithms, graph partitioning, tensor sharding, and code generation using MLIR, LLVM, and Triton.

What you'd actually do

  1. Crafting and implementing compiler optimization techniques for deep learning network graphs.
  2. Designing novel graph partitioning and tensor sharding techniques for distributed training and inference.
  3. Performance tuning and analysis.
  4. Code-generation for NVIDIA GPU backends using open-source compilers such as MLIR, LLVM and OpenAI Triton.
  5. Designing user facing features in JAX and related libraries and other general software engineering work.

Skills

Required

  • 4+ years of relevant work or research experience in performance analysis and compiler optimizations.
  • Ability to work independently, define project goals and scope, and lead your own development effort adopting clean software engineering and testing practices.
  • Excellent C/C++ programming and software design skills, including debugging, performance analysis, and test design.
  • Strong foundation in architecture of CPU, GPUs or other high performance hardware accelerators. Knowledge of high-performance computing and distributed programming.
  • Strong interpersonal skills are required along with the ability to work in a dynamic product-oriented team.

Nice to have

  • CUDA or OpenCL programming experience
  • Experience with the following technologies is a huge plus: XLA, TVM, MLIR, LLVM, OpenAI Triton, deep learning models and algorithms, and deep learning framework design.
  • Experience working deep learning frameworks such as JAX, PyTorch or TensorFlow.
  • Extensive experience with CUDA or with GPUs in general.
  • Experience with open-source compilers such as XLA, LLVM, MLIR or TVM.

What the JD emphasized

  • compiler optimization
  • deep learning workloads
  • distributed training and inference
  • performance analysis
  • compiler optimizations
  • C/C++ programming
  • high performance hardware accelerators
  • deep learning models and algorithms
  • deep learning framework design

Other signals

  • optimize inference and training performance
  • deep learning workloads
  • NVIDIA GPUs at scale
  • compiler optimization techniques
  • distributed training and inference
  • code-generation for NVIDIA GPU backends