Senior Distinguished Engineer, AI Compute (remote Eligible)

Capital One Capital One · Banking · San Francisco, CA +5 · Remote

Senior Distinguished Engineer focused on architecting and building the AI compute infrastructure for Capital One's enterprise machine learning platform. This role involves developing scalable, high-performance systems for diverse AI workloads including LLM pre-training, fine-tuning, inference, and agentic applications, leveraging distributed compute frameworks like Ray and Spark on cloud substrates.

What you'd actually do

  1. Architect and build control and data plane implementations required to realize a highly available, multi-tenant, large scale and a secure machine learning platform
  2. Develop Ray and Spark distributed compute engine solutions to accelerate diverse workloads from LLM pre-training and reinforcement learning to large-scale data processing, while maximizing compute unit economics
  3. Engineer systemic improvements for operational excellence including automating KTLO (Keep The Lights On) workflows
  4. Direct the technical execution of a diverse project portfolio, collaborating with developers specializing in everything ranging from distributed microservices to running large foundation models
  5. Work cross-functionally with product and program management disciplines, and stakeholder and partners across Capital One to help optimize business outcomes while driving towards strong technology solutions

Skills

Required

  • Bachelor's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 10 years of experience developing AI and ML algorithms or technologies, or a Master's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 8 years of experience developing AI and ML algorithms or technologies
  • At least 10 years of experience programming with Python, Go, Scala, or Java
  • distributed systems
  • large scale systems
  • highly available systems
  • high performance systems
  • compute infrastructure
  • CPU and GPU substrates
  • ML / DL model training
  • model inference
  • feature generation pipelines
  • pre-training
  • fine tuning Transformer-based models
  • generative AI inference
  • agentic applications
  • Golang
  • Python
  • Spark
  • Dask
  • Ray
  • Flink
  • Kubernetes
  • AWS Lambda
  • ML+AI workload patterns
  • control plane implementations
  • data plane implementations
  • multi-tenant
  • secure machine learning platform
  • reinforcement learning
  • large-scale data processing
  • compute unit economics
  • operational excellence
  • automating KTLO workflows
  • technical execution of a diverse project portfolio
  • distributed microservices
  • running large foundation models
  • cross-functionally with product and program management
  • stakeholder and partners
  • optimize business outcomes
  • strong technology solutions
  • system design
  • code review sessions
  • Distinguished Engineering community
  • mentoring internal talent
  • recruiting external talent

Nice to have

  • Master’s Degree in Computer Science or a Master’s Degree in Software Engineering
  • Hands on experience in the internals of Ray (Actors/GCS/Scheduling) or Spark (Query Optimizer/Memory Management)
  • Experience building platforms that support LLM training, fine-tuning, or high-throughput inference
  • Hands-on experience with AWS-specific compute primitives (EKS, EC2 UltraClusters, Graviton) and cost-optimization strategies
  • History of upstream contributions to major distributed systems projects

What the JD emphasized

  • hands-on technical leader passionate about distributed systems
  • engineer and scale foundational compute capabilities
  • building large scale, highly available and high performance systems
  • ML / DL model training, model inference and feature generation pipelines
  • pre-training and fine tuning Transformer-based models
  • generative AI inference and agentic applications
  • depth of expertise in technologies including Golang and Python programming languages, popular distributed compute frameworks including Spark / Dask / Ray / Flink, container (e.g., Kubernetes) and serverless (e.g., AWS Lambda) runtime environments, and ML+AI workload patterns

Other signals

  • building large scale, highly available and high performance systems
  • common compute infrastructure on top of CPU and GPU substrates
  • ML / DL model training, model inference and feature generation pipelines
  • pre-training and fine tuning Transformer-based models
  • generative AI inference and agentic applications
  • distributed systems
  • machine learning platform organization
  • high-scale developer and runtime environments