Staff Software Engineer, Compute Infrastructure

Databricks Databricks · Data AI · Mountain View, CA · Engineering

Databricks is seeking a Staff Software Engineer, Compute Infrastructure to design and build the systems that power their compute infrastructure. This role involves developing compute abstractions, workload orchestration and scheduling systems, and fleet management systems to enable engineers to launch and scale products efficiently and reliably across major cloud providers. The ideal candidate has extensive experience in large-scale distributed systems and cloud platforms.

What you'd actually do

  1. Develop the compute abstractions that provide powerful capabilities for all Databricks workloads, enabling engineers to build world-class products with high velocity and best-in-class performance
  2. Design the workload orchestration and scheduling systems that orchestrates all types of workloads (serving, batch, stateful, GPU) with high performance and efficiency
  3. Scale the fleet management systems that launch and configure millions of VMs every day across cloud providers
  4. Raise the technical and operational bar through strong design practices, testing, and a culture of engineering excellence and platform mindset.
  5. Lead cross-team initiatives that span product and infrastructure surface areas.

Skills

Required

  • BS (or higher) in Computer Science or related field
  • 10+ years of experience designing and building large-scale distributed systems
  • Strong proficiency in one or more languages such as Java, Scala, Go, or C++
  • Experience with service-oriented architectures and large scale distributed systems
  • Familiarity with cloud platforms (AWS, Azure, GCP) and container/orchestration technologies (Kubernetes, Docker)
  • Track record of shipping infrastructure that supports mission-critical workloads at scale

What the JD emphasized

  • 10+ years of experience designing and building large-scale distributed systems
  • Track record of shipping infrastructure that supports mission-critical workloads at scale