Staff Software Engineer - Genai Inference

Databricks Databricks · Data AI · San Francisco, CA · Engineering - Pipeline

Staff Software Engineer focused on the GenAI inference engine at Databricks, responsible for architecture, development, and optimization of high-throughput, low-latency LLM inference. This role involves kernel-level optimization, runtime development, orchestration, and integration with ML frameworks, bridging research advances with production demands.

What you'd actually do

  1. Own and drive the architecture, design, and implementation of the inference engine, and collaborate on model-serving stack optimized for large-scale LLMs inference
  2. Partner closely with researchers to bring new model architectures or features (sparsity, activation compression, mixture-of-experts) into the engine
  3. Lead the end-to-end optimization for latency, throughput, memory efficiency, and hardware utilization across GPUs, and accelerators
  4. Define and guide standards to build and maintain instrumentation, profiling, and tracing tooling to uncover bottlenecks and guide optimizations
  5. Architect scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads

Skills

Required

  • Software engineering (6+ years)
  • ML inference internals
  • CUDA
  • GPU programming
  • distributed systems design
  • performance optimization
  • instrumentation, tracing, and profiling tools

Nice to have

  • published research
  • open-source contributions in ML systems, inference optimization, or model serving

What the JD emphasized

  • performance-critical systems
  • owning complex system components
  • architectural decisions end-to-end
  • ML inference internals
  • CUDA, GPU programming
  • distributed systems design
  • performance bottlenecks

Other signals

  • LLM inference engine
  • low latency
  • high throughput
  • GPU optimization