Performance Engineer, GPU

Anthropic Anthropic · AI Frontier · San Francisco, CA · AI Research & Engineering

This role focuses on optimizing GPU performance and systems engineering for large language models, specifically improving utilization and efficiency for inference and training at scale. It involves deep work in GPU programming, custom kernel development, and distributed systems.

What you'd actually do

  1. Architect and implement the foundational systems that power Claude and push the frontiers of what's possible with large language models.
  2. Maximize GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.
  3. Implement state-of-the-art techniques from custom kernel development to distributed system architectures.
  4. Span the entire stack—from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.
  5. Deliver transformative GPU performance improvements in production ML systems.

Skills

Required

  • GPU programming
  • optimization at scale
  • CUDA
  • Triton
  • CUTLASS
  • Flash Attention
  • tensor core optimization
  • PyTorch/JAX internals
  • torch.compile
  • XLA
  • custom operators
  • kernel fusion
  • memory bandwidth optimization
  • profiling
  • NCCL
  • NVLink
  • collective communication
  • model parallelism
  • INT8/FP8 quantization
  • mixed-precision techniques
  • large-scale training infrastructure
  • fault tolerance
  • cluster orchestration
  • inference pipelines
  • serving infrastructure

Nice to have

  • hardware interfaces
  • high-level ML frameworks
  • pair programming
  • ambiguous environments

What the JD emphasized

  • deep experience with GPU programming and optimization at scale
  • delivering transformative GPU performance improvements in production ML systems

Other signals

  • GPU performance optimization
  • large language models
  • inference efficiency
  • distributed systems
  • custom kernel development