Principal Engineer, Compute Fleet Management

Databricks Databricks · Data AI · United States · Executive Engineering - Pipeline

Databricks is seeking a Principal Engineer to lead Compute Fleet Management, focusing on optimizing and scaling cloud compute infrastructure across AWS, Azure, and GCP. This role is critical for improving gross margin and customer experience by pioneering fleet optimization, delivering hyper-scale resilience, and owning the critical path for the compute platform. The ideal candidate will have a strong background in distributed systems and experience operating large-scale infrastructure, with a preference for GPU fleet management for AI/ML workloads.

What you'd actually do

  1. Provisioning and pooling of O(Billion)s of cloud resources to achieve peak workload performance, industry-leading efficiency, and robust resource isolation.
  2. Build the architecture that guarantees horizontal scaling and resilience against zonal or even cloud account-level failures, ensuring Databricks is always on.
  3. Lead the development of the lowest-dependency systems required to bootstrap and manage our massive compute platform.

Skills

Required

  • Operating large-scale distributed systems
  • Experience with at least one major public cloud
  • Leading transformative projects
  • Execution discipline
  • Planning and tracking project progress
  • Managing complex cross-organizational dependencies

Nice to have

  • Managing and scaling a massive fleet of GPUs for AI/ML workloads
  • Developing and operating large-scale distributed systems across all major clouds (AWS, Azure, and GCP)

What the JD emphasized

  • mission-critical
  • masssive compute platform
  • large-scale, mission-critical infrastructure systems
  • complex, cross-team, cross-layer, and multi-quarter strategic engineering initiatives