Lead Machine Learning Engineer (gen Ai, Python, Go, Aws)

Capital One Capital One · Banking · San Francisco, CA +3

Lead Machine Learning Engineer focused on designing, building, and productionizing Generative AI applications and Agentic Workflow systems at scale. The role involves building robust ML serving architecture, developing high-performance code, and ensuring low latency and high availability of AI solutions, with a strong emphasis on cloud-native platforms and MLOps.

What you'd actually do

  1. Design, build, and deliver GenAI models and components** **that solve complex business problems, while working in collaboration with the Product and Data Science teams.
  2. Design and implement cloud-native ML Serving Platforms leveraging technologies like Docker, Kubernetes, KNative, and KServe to ensure optimized and scalable deployment of models.
  3. Solve complex scaling and high-availability problems by writing and testing performant application code in Python and Go-lang, developing and validating ML models, and automating tests and deployment.
  4. Implement advanced MLOps and GitOps practices for continuous integration and continuous deployment (CI/CD) using tools like ArgoCD to manage the entire lifecycle of models and applications.
  5. Leverage service mesh architectures like Istio to manage traffic, enhance security, and ensure resilience for high-volume serving endpoints.

Skills

Required

  • Bachelor's Degree
  • 6 years of experience designing and building data-intensive solutions using distributed computing
  • 4 years of experience programming with Python, Scala, Go or Java
  • 2 years of experience building, scaling, and optimizing ML systems

Nice to have

  • Master's or Doctoral Degree in computer science, electrical engineering, mathematics, or a similar field
  • 3+ years of experience building production-ready data pipelines that feed ML models
  • 3+ years of on-the-job experience with an industry recognized ML framework such as scikit-learn, PyTorch, Dask, Spark, or TensorFlow
  • 2+ years of experience developing performant, resilient, and maintainable code
  • 2+ years of experience with data gathering and preparation for ML models
  • 2+ years of people leader experience
  • 1+ years of experience leading teams developing ML solutions using industry best practices, patterns, and automation
  • Experience developing and deploying ML solutions in a public cloud such as AWS, Azure, or Google Cloud Platform
  • Experience designing, implementing, and scaling complex data pipelines for ML models and evaluating their performance
  • ML industry impact through conference presentations, papers, blog posts, open source contributions, or patents

What the JD emphasized

  • GenAI Workflows Serving team
  • Generative AI applications
  • Agentic Workflow systems
  • ML serving architecture
  • Responsible and Explainable AI

Other signals

  • productionizing Generative AI applications
  • Agentic Workflow systems at massive scale
  • ML serving architecture
  • high-performance application code
  • high availability, security, and low latency of Generative AI solutions