Principal Software Engineer

Workday Workday · Enterprise · Toronto, ON +1

This Principal Software Engineer role focuses on building and scaling MLOps and CloudOps infrastructure for Workday's AI platform. The engineer will design and implement microservices, APIs, and cloud infrastructure using Python, Go, Terraform, and Kubernetes, with a focus on enabling the development, training, deployment, and lifecycle management of ML capabilities. The role also involves supporting Agentic AI systems and LLM-based applications, requiring experience with distributed systems, cloud computing, and infrastructure-as-code.

What you'd actually do

  1. Architect Distributed Systems: Lead the design and implementation of high-throughput microservices and APIs (Python/Go) that serve as the backbone for Workday’s ML ecosystem.
  2. Engineer the Platform: Build and optimize a unified ML development experience using Kubeflow, Kubernetes (EKS/GKE), and specialized compute orchestration (CPUs/GPUs).
  3. Scale Cloud Infrastructure: Own the end-to-end lifecycle of cloud-based services, utilizing Infrastructure as Code (Terraform) to build resilient, self-healing environments.
  4. Drive Engineering Excellence: Lead architecture reviews, code reviews, and technology evaluations to ensure our systems meet 99.99% reliability standards.
  5. Support Agentic AI: Design the architectural patterns and observability frameworks required to support emerging Agentic AI systems and LLM-based applications.

Skills

Required

  • 6 or more years of validated industry experience
  • Bachelor's and/or Master’s degree in Computer Science or Computer Engineering
  • Strong software engineering experience with designing and building scalable, distributed systems
  • Deep understanding of cloud computing, cloud infrastructure, and distributed systems; experience with AWS and GCP
  • Experience developing microservices, APIs, robust cloud service, large-scale web applications, managing CI/CD workflows
  • Proficiency with Python, Go, and infrastructure-as-code tools like Terraform
  • Experience running and maintaining Kubernetes clusters in production
  • Ensure security and compliance of cloud platforms, implementing best practices for encryption, data protection, and access control

Nice to have

  • Experience with large-scale ML data pipelines and data lakes
  • Ability to think across layers of the ML stack, from infrastructure to model deployment
  • Experience developing monitoring and alerting systems for ML infrastructure
  • Understanding of agentic AI concepts; experience with LangChain and LangSmith is preferred
  • Proven leadership or mentoring experience

What the JD emphasized

  • MLOps
  • CloudOps
  • data engineering
  • Kubernetes
  • Python
  • Go
  • Terraform
  • Agentic AI
  • LLM
  • 99.99% reliability standards

Other signals

  • MLOps
  • CloudOps
  • data engineering
  • Kubernetes
  • Python
  • Go
  • Terraform
  • Agentic AI
  • LLM