Staff Software Engineer - AI Research Infrastructure

Databricks Databricks · Data AI · San Francisco, CA · Engineering - Pipeline

Staff Software Engineer focused on building and operating the AI research infrastructure at Databricks. This role involves designing and implementing services for large-scale training and inference workloads, improving developer tooling, and ensuring reliability, efficiency, and security for AI research. The engineer will partner with researchers and ML engineers to create robust pipelines and influence the long-term roadmap for research computation.

What you'd actually do

  1. Design and implement infrastructure that supports large‑scale experiments, data processing, and model training (e.g., HPC clusters, GPU fleets, or cloud‑based systems)
  2. Enable researchers to go from idea to large‑scale experiment in minutes, not days, by building powerful abstractions for job submission, scheduling, and monitoring.
  3. Create tooling that improves research developer productivity, such as experiment management systems, CI/testing infrastructure for research code, and workflows that reduce iteration time.
  4. Influence the long‑term roadmap for research computation, shaping how Databricks AI Research train, evaluate, and ship models to customers.
  5. Serve as a technical mentor and force multiplier for other engineers working on compute, infra, and AI systems.

Skills

Required

  • BS/MS or PhD in Computer Science or related field
  • 5+ years of software engineering experience
  • substantial time working on large‑scale distributed systems or infrastructure
  • deep experience with building and operating distributed systems, data pipelines, or large‑scale backend services, ideally involving GPUs, clusters, or major cloud providers
  • proficient in one or more systems programming languages (e.g., C++, Rust, Go, Java, Scala)
  • design, implement, and debug complex services
  • built or significantly contributed to cluster schedulers, resource managers, or large‑scale job orchestration systems (e.g., Kubernetes, Slurm, Ray, custom internal systems)
  • Understand modern ML training and inference workflows (e.g., distributed training, model parallelism, fine‑tuning, evaluation)
  • move fast and be pragmatic in getting things done
  • caring about operational excellence
  • driven complex systems from prototype to stable, well‑owned services
  • Communicate clearly with both researchers and engineers
  • translating between research needs and infra realities

What the JD emphasized

  • large‑scale distributed systems
  • large‑scale backend services
  • large‑scale job orchestration systems
  • modern ML training and inference workflows

Other signals

  • large-scale training and inference experiment workloads
  • thousands of GPUs
  • research computation
  • ML training and inference workflows