Sr. Director, Machine Learning Engineering (remote-eligible)

Capital One Capital One · Banking · McLean, VA +1 · Remote

Sr. Director, Machine Learning Engineering to lead and scale a high-performing engineering organization responsible for the Personalization Platform. This role defines technical strategy, delivery roadmap, and operating model for recommendation systems, ranking, decisioning, GenAI infrastructure, MLOps, and low-latency serving systems. Responsibilities include building and developing engineers, partnering cross-functionally, driving ML infrastructure and pipelines, architecting low-latency systems, evolving MLOps, guiding AI/LLM optimization, and providing technical and people leadership.

What you'd actually do

  1. Lead and scale a high-performing engineering organization responsible for the Personalization Platform that powers real-time, personalized product experiences and multi-channel targeted user messaging across Capital One products and services.
  2. Define the technical strategy, delivery roadmap, and operating model for a portfolio spanning recommendation systems, ranking, decisioning, GenAI infrastructure, MLOps, and low-latency application-serving systems
  3. Build, develop, and manage engineers and engineering leaders; set a high bar for hiring, performance, talent density, coaching, and succession planning across the organization
  4. Partner cross-functionally with Product, Data Science, Cloud Infrastructure, and Machine Learning Platform teams to align strategy, prioritize investments, and co-develop advanced recommendation systems and algorithms serving Capital One users
  5. Drive the design, buildout, and operation of robust ML infrastructure and pipelines supporting feature extraction, model training, testing, guardrails, evaluation, deployment, and both real-time and batch inference with strong reliability, scalability, and operational rigor

Skills

Required

  • Bachelor's degree in Computer Science, Engineering, or AI plus at least 10 years of experience developing or leading AI and ML algorithms or technologies, or Master's degree plus at least 8 years of experience developing or leading AI and ML algorithms or technologies
  • At least 5 years of people leadership experience

Nice to have

  • 7 years of experience managing and leading an engineering team
  • 8+ years of experience deploying scalable, responsible AI solutions on major cloud platforms (AWS, GCP, Azure)
  • Master’s or PhD in Computer Science or a relevant technical field
  • Proven expertise designing, implementing, and scaling personalization platforms and recommendation systems across feed personalization, ads ranking, or targeted marketing messaging
  • Proficiency in Python, Java, C++, or Golang; hands-on experience with ML frameworks (PyTorch, TensorFlow) and orchestration tools (Databricks, Airflow, Kubeflow)
  • Experience optimizing large-scale training and inference systems for hardware utilization, latency, throughput, and cost
  • Deep expertise in cloud-native engineering, containerization (Docker, Kubernetes), and automated CI/CD deployment
  • Deep experience with MLOps, model observability, and production ML lifecycle management
  • Strong track record building organizations, developing managers and senior engineers, and leading through scale and ambiguity
  • Excellent communication and presentation skills, with the ability to influence senior stakeholders

What the JD emphasized

  • responsible and reliable AI systems
  • real-time, personalized customer experiences
  • production-grade ML and GenAI systems
  • low-latency application platforms
  • real-time personalization
  • scalable observability
  • large-scale production AI systems
  • responsible AI solutions

Other signals

  • building world-class applied science and engineering teams
  • Hyper Personalization org is building the intelligence and infrastructure
  • production-grade ML and GenAI systems
  • low-latency application platforms
  • real-time, personalized product experiences
  • recommendation systems, ranking, decisioning, GenAI infrastructure, MLOps, and low-latency application-serving systems
  • advanced recommendation systems and algorithms
  • ML infrastructure and pipelines supporting feature extraction, model training, testing, guardrails, evaluation, deployment, and both real-time and batch inference
  • low-latency, event-driven systems for real-time personalization and decisioning
  • MLOps practices through automated, metrics-backed deployment workflows, validation and testing systems, model lifecycle governance, and scalable observability
  • AI and LLM optimization techniques to improve scalability, cost, latency, throughput, and reliability of large-scale production AI systems
  • build-vs-buy decisions across a broad stack of Open Source and SaaS AI technologies such as AWS Ultraclusters, Huggingface, VectorDBs, Nemo Guardrails, PyTorch, and more.