Staff Software Engineer, AI Runtime

Databricks Databricks · Data AI · Mountain View, CA · Engineering

Staff Software Engineer for Databricks' AI Runtime (AIR) platform, focusing on building and scaling systems for large-scale GPU training and fine-tuning. The role involves driving architecture, performance, resilience, and developer experience for multi-node training jobs across thousands of GPUs, supporting both fine-tuning and pre-training of foundation models.

What you'd actually do

  1. Drive the architecture and evolution of AIR's managed GPU training platform, delivering scalable, high-throughput, and resilient training across fleets that span thousands of accelerators.
  2. Solve the hardest problems in large-scale training, including multi-node orchestration, distributed parallelism strategies, GPU scheduling and dynamic routing, high-throughput data loading, and checkpoint and restore for very long-running jobs.
  3. Push GPU efficiency and training performance, raising utilization (such as model FLOPs utilization and end-to-end throughput) and lowering cost per training run across diverse model architectures and hardware generations.
  4. Build the resilience and observability foundations that keep multi-node jobs healthy, detecting and recovering from hardware and software failures with minimal disruption to customers.
  5. Partner with product, research, and platform teams to shape the APIs, CLI, and developer experience that make it easy to launch, monitor, and debug production training jobs.

Skills

Required

  • building and operating large-scale distributed systems
  • GPU training infrastructure
  • high-performance computing
  • ML systems
  • distributed training frameworks (PyTorch, FSDP, DeepSpeed, or Megatron)
  • parallelism strategies (data, tensor, pipeline, and sequence parallelism)
  • training resilience patterns (checkpointing, failure detection, automatic recovery)
  • GPU performance fundamentals (accelerator architecture, high-speed interconnects, collective communication)
  • building and operating managed, multi-tenant platform products in the cloud
  • algorithms
  • data structures
  • system design
  • performance-sensitive, large-scale distributed systems
  • deliver technically complex, high-impact initiatives
  • collaboration across product, research, and infrastructure teams
  • strategic, product-oriented mindset
  • mentoring engineers
  • fostering technical excellence
  • BS in Computer Science or related field

Nice to have

  • MS or PhD preferred

What the JD emphasized

  • 10+ years of experience building and operating large-scale distributed systems
  • significant depth in GPU training infrastructure, high-performance computing, or ML systems
  • Hands-on experience with distributed training frameworks
  • Strong understanding of training resilience patterns
  • Solid grasp of GPU performance fundamentals

Other signals

  • large-scale GPU training
  • managed platform
  • serverless experience
  • orchestrating multi-node jobs
  • fine-tuning open models
  • pre-training foundation models