Principal Machine Learning Engineer

Workday Workday · Enterprise · New York, NY

Principal Machine Learning Engineer to design and build core ML systems for Workday's next-generation AI agents. This role involves owning the full lifecycle of LLM-powered agents, including RAG pipelines, orchestration, evaluation, and feedback loops, ensuring scalability, observability, and enterprise readiness. It sits at the intersection of ML and platform engineering, requiring strong engineering judgment to build reliable and explainable systems at global scale.

What you'd actually do

  1. design and build the core ML systems behind Workday’s next generation of AI agents
  2. own how models, agent logic, and orchestration layers come together in production—across the full lifecycle from problem framing and data strategy to deployment, monitoring, and continuous improvement
  3. implement and evolve frameworks for LLM-powered agents, including RAG pipelines, workflow orchestration, evaluation, and feedback loops, ensuring solutions are scalable, observable, and enterprise-ready
  4. stay hands-on with emerging techniques in agentic architectures while applying strong engineering judgment to turn them into systems that are reliable, explainable, and built to operate at global scale

Skills

Required

  • Pytorch
  • TensorFlow
  • building services to host machine learning models in production at scale
  • large language models (LLMs)
  • text generation models
  • graph neural network models
  • cloud computing platforms (e.g. AWS, GCP, etc.)
  • leading, mentoring, and/or managing ML Engineering teams
  • development lifecycle and sprint planning
  • statistical analysis
  • unsupervised and supervised machine learning algorithms
  • natural language processing
  • information retrieval
  • recommendation system use cases
  • solving ambiguous, open-ended problems
  • technically leading teams
  • communication skills
  • building strong relationships across teams and stakeholders

Nice to have

  • Master’s or PhD preferred
  • advancements in AI, LLMs, RAG, autonomous agents and orchestration frameworks

What the JD emphasized

  • production-grade AI
  • global scale
  • enterprise-ready
  • deeply embedded into Workday’s platform
  • own problems end to end
  • autonomy to shape how agents reason, act, and scale responsibly
  • full lifecycle from problem framing and data strategy to deployment, monitoring, and continuous improvement
  • scalable, observable, and enterprise-ready
  • reliable, explainable, and built to operate at global scale
  • 10+ years experience
  • 4+ years of professional experience in machine learning and deep learning frameworks & toolkits
  • 6+ years of professional experience in building services to host machine learning models in production at scale
  • 3+ years of demonstrated experience working with large language models (LLMs)
  • 6+ years of proven experience with cloud computing platforms
  • Proven track record of successfully leading, mentoring, and/or managing ML Engineering teams

Other signals

  • production-grade AI
  • intelligent agents
  • global scale
  • enterprise-ready