Distinguished Engineer

Capital One Capital One · Banking · San Jose, CA +3

Distinguished Engineer to anchor the Foundation Model (FM) Hosting team, focusing on efficient, reliable, and rapid serving of large language models at scale. The role involves pushing the limits of LLM inference, owning the technical strategy for the FM serving stack, and bridging the gap between AI Science and production infrastructure. Responsibilities include designing and driving the roadmap for high throughput, ultra-low latency, and optimal GPU utilization, leading performance engineering, and co-designing model architectures for deployability with AI Research & Science teams.

What you'd actually do

  1. Design and drive the long-term technical roadmap for our Foundation Model Hosting platform, ensuring high throughput, ultra-low latency, and optimal GPU utilization across massive, multi-tenant workloads.
  2. Lead performance engineering across both the platform and model layers. You will pioneer the implementation of advanced techniques such as speculative decoding, continuous batching, kv-cache optimization (PagedAttention), and custom quantization strategies (FP8, INT4, AWQ).
  3. Act as the primary engineering counterpart to our AI Research & Science teams. You will co-design model architectures for deployability, ensuring that the latest foundational models seamlessly transition from the lab to highly optimized production environments.
  4. Mentor senior engineers, establish rigorous engineering standards for AI deployment, and foster a culture of uncompromising technical excellence.

Skills

Required

  • Software engineering
  • public or private cloud technologies
  • Networking
  • distributed inference communication primitives
  • Tensor Parallelism (TP) and Pipeline Parallelism (PP) architectures

Nice to have

  • Java
  • Python
  • Go
  • JavaScript
  • TypeScript
  • Swift
  • full lifecycle of system development
  • NCCL optimization
  • NVLink/NVSwitch utilization
  • InfiniBand/RDMA tuning
  • published research or papers at top-tier Machine Learning and Systems conferences
  • patents related to distributed systems, model compression, or AI inference scaling
  • routing and scheduling mechanisms for split-architecture serving or multi-LoRA serving architectures

What the JD emphasized

  • pushing the absolute limits of LLM inference physics
  • shaving off milliseconds of latency
  • writing custom CUDA kernels to bypass hardware bottlenecks
  • architecting distributed systems that scale effortlessly on Kubernetes
  • high throughput, ultra-low latency, and optimal GPU utilization
  • co-design model architectures for deployability
  • highly optimized production environments
  • Contributions, active maintainer status, or core authorship in open-source AI infrastructure or serving projects
  • Published research or papers at top-tier Machine Learning and Systems conferences

Other signals

  • foundation model hosting
  • LLM inference
  • AI infrastructure
  • low latency
  • GPU utilization