Principal Associate, Data Scientist - LLM Customization Team

Capital One Capital One · Banking · McLean, VA +2

Capital One's LLM Customization team is seeking a Principal Associate Data Scientist to work on GenAI models. The role involves creating high-quality data for training and testing, building capabilities for evaluating and monitoring generative models, and developing horizontal capabilities like search, summarization, RAG, and agentic workflows for production applications. The candidate will partner with cross-functional teams, leverage technologies like Pytorch, AWS, Hugging Face, LangChain, and VectorDBs, and adapt/finetune LLMs for customer-facing applications. Responsibilities include building ML and NLP models through all development phases, from design to evaluation and validation, and operationalizing them in production systems serving over 80 million customers. Experience in training language models, explainability, RLHF, and delivering models at scale is required.

What you'd actually do

  1. Partner with a cross-functional team of data scientists, software engineers, machine learning engineers and product managers to deliver AI powered products that change how customers interact with their money.
  2. Leverage a broad stack of technologies — Pytorch, AWS Ultraclusters, Hugging Face, LangChain, Lightning, VectorDBs, and more — to reveal the insights hidden within huge volumes of numeric and textual data.
  3. Be the expert in Natural Language Processing (NLP) to harness the power of Large Language Models (LLMs), adapt and finetune them for customer facing applications and features.
  4. Build machine learning and NLP models through all phases of development, from design through training, evaluation, and validation; partnering with engineering teams to operationalize them in scalable and resilient production systems that serve 80+ million customers.
  5. Flex your interpersonal skills to translate the complexity of your work into tangible business goals.

Skills

Required

  • Natural Language Processing (NLP)
  • Large Language Models (LLMs)
  • Pytorch
  • AWS
  • Hugging Face
  • LangChain
  • VectorDBs
  • training language models
  • explainability
  • RLHF
  • Python
  • SQL

Nice to have

  • Machine learning
  • Scala
  • R
  • computer vision models
  • training optimization
  • self-supervised learning

What the JD emphasized

  • delivering models at scale both in training data and inference volumes
  • experience in delivering libraries, platforms, or solution level code to existing products
  • training language models or large computer vision models
  • expertise in one or more key subdomains such as: training optimization, self-supervised learning, explainability, RLHF

Other signals

  • LLM Customization team is on the cutting edge of GenAI
  • AI Training Team touches every aspect of the model development life cycle
  • deployed models in production drive business impact
  • build capabilities for evaluating and monitoring generative models
  • build search, summarization, RAG, and agentic workflows for integration in production applications