Staff Software Engineer, Foundational Model Serving

Databricks Databricks · Data AI · Mountain View, CA · Engineering

Staff Software Engineer focused on building and operating high-scale, low-latency inference systems for foundational AI models (LLMs) at Databricks. The role involves designing and implementing core systems and APIs for model serving, optimizing performance on GPU workloads, and influencing architectural direction for the Foundation Model Serving product.

What you'd actually do

  1. Design and implement core systems and APIs that power Databricks Foundation Model Serving, ensuring scalability, reliability, and operational excellence.
  2. Partner with product and engineering leadership to define the technical roadmap and long-term architecture for serving workloads.
  3. Drive architectural decisions and trade-offs to optimize performance, throughput, autoscaling, and operational efficiency for GPU serving workloads.
  4. Contribute directly to key components across the serving infrastructure — from working in systems like vLLM and SGLang to creating token based rate limiters and optimizers — ensuring smooth and efficient operations at scale.
  5. Collaborate cross-functionally with product, platform, and research teams to translate customer needs into reliable and performant systems.

Skills

Required

  • 10+ years of experience building and operating large-scale distributed systems
  • Experience leading high-scale operationally sensitive backend systems
  • Strong foundation in algorithms, data structures, and system design as applied to large-scale, low-latency serving systems
  • Proven ability to deliver technically complex, high-impact initiatives that create measurable customer or business value
  • Strong communication skills and ability to collaborate across teams in fast-moving environments
  • Strategic and product-oriented mindset with the ability to align technical execution with long-term vision

Nice to have

  • interest in getting deep building LLM APIs and runtimes at scale
  • mentoring, growing engineers, and fostering technical excellence

What the JD emphasized

  • high scale operational sensitive systems
  • high-throughput, low-latency inference
  • GPU workloads
  • large-scale distributed systems
  • low-latency serving systems

Other signals

  • foundation model serving
  • high-throughput, low-latency inference
  • GPU workloads
  • large-scale distributed systems