Senior Software Engineer, AI Runtime

Databricks Databricks · Data AI · Mountain View, CA · Engineering

Databricks is seeking a Senior Software Engineer to build and scale the AI Runtime (AIR) platform, a managed system for large-scale GPU training and fine-tuning. The role involves driving the architecture and evolution of the training stack, focusing on scheduling, distributed training performance, fault tolerance, and developer experience. The engineer will solve complex problems in multi-node orchestration, GPU scheduling, data loading, and resilience for long-running jobs, aiming to improve GPU efficiency and reduce training costs.

What you'd actually do

  1. Drive the architecture and evolution of AIR's managed GPU training platform, delivering scalable, high-throughput, and resilient training across fleets that span thousands of accelerators.
  2. Solve the hardest problems in large-scale training, including multi-node orchestration, distributed parallelism strategies, GPU scheduling and dynamic routing, high-throughput data loading, and checkpoint and restore for very long-running jobs.
  3. Push GPU efficiency and training performance, raising utilization (such as model FLOPs utilization and end-to-end throughput) and lowering cost per training run across diverse model architectures and hardware generations.
  4. Build the resilience and observability foundations that keep multi-node jobs healthy, detecting and recovering from hardware and software failures with minimal disruption to customers.
  5. Partner with product, research, and platform teams to shape the APIs, CLI, and developer experience that make it easy to launch, monitor, and debug production training jobs.

Skills

Required

  • 5+ years of experience building and operating large-scale distributed systems, with experience in GPU training infrastructure, high-performance computing, or ML systems.
  • Experience with distributed training frameworks (such as PyTorch, FSDP, DeepSpeed, or Megatron) and the parallelism strategies (data, tensor, pipeline, and sequence parallelism) used to train large models.
  • Strong understanding of training resilience patterns, including checkpointing, failure detection, and automatic recovery for long-running, multi-node jobs.
  • Solid grasp of GPU performance fundamentals, including accelerator architecture, high-speed interconnects (such as NVLink and InfiniBand or RoCE), collective communication, and the bottlenecks that govern training throughput and utilization.
  • Experience building and operating managed, multi-tenant platform products in the cloud, with clear SLAs and SLOs for availability, performance, and reliability.
  • Strong foundation in algorithms, data structures, and system design as applied to performance-sensitive, large-scale distributed systems.
  • Proven ability to deliver technically complex, high-impact initiatives that create clear customer or business value.
  • Strong communication skills and the ability to collaborate across product, research, and infrastructure teams in a fast-moving environment.
  • Customer-focused mindset with the ability to align implementation details with product goals, and a passion for mentoring engineers and fostering technical excellence.

Nice to have

  • MS or PhD preferred

What the JD emphasized

  • large-scale GPU training
  • multi-node orchestration
  • distributed training performance
  • fault tolerance
  • resilience
  • observability

Other signals

  • large-scale GPU training
  • managed platform
  • distributed training performance
  • fault tolerance
  • developer experience