Sr Machine Learning / AI Engineer / Mlops Engineer

Workday Workday · Enterprise · Dublin, Ireland

This role focuses on operating, hardening, and improving the production infrastructure for an AI agent, including deployments, monitoring, incident response, and building tooling for agent trajectory and evaluation data. It involves managing the lifecycle of AI features, ensuring reliability of agentic loops, memory stores, and tool-use environments, and supporting performance testing for LLM-driven applications. The role requires experience with MLOps, LLM/agentic systems, containerization, and a deep understanding of the ML development lifecycle, including monitoring and evaluation.

What you'd actually do

  1. operate, harden, and continuously improve the production infrastructure that powers the Peakon Agent, multi-agent architectures, AI Features and related ML workloads
  2. manage the entire deployment lifecycle for the Peakon Agent and other AI Features, ensuring the reliability of long-running agentic loops, memory stores, and tool-use environments
  3. build and maintain tooling to surface agent trajectory and evaluation data, supporting performance testing, latency benchmarking, and load simulations specific to LLM-driven applications
  4. collaborate on the automation of essential security upgrades for ML dependencies
  5. provide clear runbooks, robust observability, on-call and predictable incident response

Skills

Required

  • Python
  • LangChain
  • LlamaIndex
  • Docker
  • Kubernetes
  • GitOps
  • GitHub Actions
  • MLOps
  • LLM
  • Agentic systems
  • Model monitoring
  • Regression tracking
  • Automated evaluation
  • LangSmith
  • System Design
  • Architectural Governance
  • Threat modeling
  • Guardrails
  • Regulated enterprise environments
  • Data auditability
  • Compliance

Nice to have

  • advanced fine-tuning
  • alignment techniques
  • prompt engineering
  • simulations of agent behaviors
  • RAG
  • autonomous decision-making agents

What the JD emphasized

  • Proven track record as an MLOps or ML-savvy SRE/Platform Engineer supporting production-grade LLM and agentic systems
  • Deep understanding of the model development lifecycle, specifically regarding model monitoring, regression tracking, and automated evaluation using tools like LangSmith
  • Proven experience navigating highly regulated enterprise environments to ensure data auditability, clear ownership boundaries, and strict compliance

Other signals

  • MLOps
  • LLM
  • Agentic systems
  • Production infrastructure
  • Observability
  • Reliability