Distinguished Software Engineer - Ifx

Capital One Capital One · Banking · San Jose, CA +4

This role is for a Distinguished Software Engineer focused on building and scaling the foundational compute infrastructure for an enterprise AI+ML platform. The engineer will work on distributed systems, cloud technologies, and support various AI/ML workloads including LLM pre-training, fine-tuning, inference, and agentic applications.

What you'd actually do

  1. Architect and build control and data plane implementations required to realize a highly available, multi-tenant, large scale and a secure machine learning platform
  2. Develop Ray and Spark distributed compute engine solutions to accelerate diverse workloads from LLM pre-training and reinforcement learning to large-scale data processing, while maximizing compute unit economics
  3. Engineer systemic improvements for operational excellence including automating KTLO (Keep The Lights On) workflows
  4. Direct the technical execution of a diverse project portfolio, collaborating with developers specializing in everything ranging from distributed microservices to running large foundation models
  5. Work cross-functionally with product and program management disciplines, and stakeholder and partners across Capital One to help optimize business outcomes while driving towards strong technology solutions

Skills

Required

  • Bachelor's Degree
  • 7 years of experience in software engineering
  • 7 years of experience designing distributed systems, backend architecture, and API platforms
  • 5 years of experience with public cloud technologies

Nice to have

  • Master’s Degree in Computer Science or a Master’s Degree in Software Engineering
  • 3+ years of experience building platforms that support LLM training, fine-tuning, or high-throughput inference
  • 3+ years of experience with AWS-specific compute primitives (EKS, EC2 UltraClusters, Graviton) and cost-optimization strategies
  • 3+ years of experience with building platforms at scale

What the JD emphasized

  • building large scale, highly available and high performance systems
  • common compute infrastructure on top of CPU and GPU substrates
  • powering everything from developer notebooks to ML / DL model training, model inference and feature generation pipelines to pre-training and fine tuning Transformer-based models as well as generative AI inference and agentic applications
  • Golang and Python programming languages
  • popular distributed compute frameworks including Spark / Dask / Ray / Flink
  • container (e.g., Kubernetes) and serverless (e.g., AWS Lambda) runtime environments
  • ML+AI workload patterns
  • building platforms that support LLM training, fine-tuning, or high-throughput inference
  • AWS-specific compute primitives (EKS, EC2 UltraClusters, Graviton) and cost-optimization strategies
  • building platforms at scale

Other signals

  • machine learning platform foundation
  • enterprise AI+ML system
  • foundational compute capabilities
  • LLM pre-training
  • model inference
  • generative AI inference
  • agentic applications