Principal, Software Development Engineer

Workday Workday · Enterprise · Toronto, ON +1

Workday is seeking a Principal Software Development Engineer to join their AI Model Serving team. This role will be a technical leader responsible for shaping the vision and direction of the platform, making critical design decisions, and driving outcomes. The team owns the services that power all production AI workloads, including vendor-hosted LLMs and Workday's own scaled models. Responsibilities include designing, implementing, and maintaining large-scale systems for moving ML models to production, evaluating new technologies, and ensuring platform health and reliability. The role requires deep experience in building and operating large-scale distributed systems, product thinking, and familiarity with both LLMs and traditional ML models.

What you'd actually do

  1. Help set the product vision for the AI Model Serving platform in partnership with the engineering manager, bringing a product-oriented mindset to infrastructure decisions.
  2. Lead the team technically by making critical design decisions that drive performance, reliability, and scalability across the platform.
  3. Design, implement, and maintain large-scale systems that enable moving ML models to production.
  4. Write design documents to build consensus for new system components and enhancements to existing components.
  5. Evaluate and uptake new technologies made available within Workday and across the broader industry.

Skills

Required

  • software development
  • building and operating large-scale distributed systems
  • Python
  • large language models
  • traditional machine learning models
  • production-grade distributed systems
  • software development lifecycle
  • coding standards
  • testing
  • code reviews
  • source control management
  • deployment
  • high-throughput platforms
  • product thinking
  • platform engineering
  • design decisions
  • usability
  • performance
  • long-term maintainability

Nice to have

  • Ray Serve
  • GCP
  • AWS Bedrock
  • Gemini

What the JD emphasized

  • 8+ years of related work experience in software development, with a focus on building and operating large-scale distributed systems.
  • Deep experience designing, building, and scaling production-grade distributed systems.
  • Deep proficiency in Python, with extensive experience writing production-level code and building systems in Python-based frameworks.
  • Familiarity with both large language models and traditional machine learning models, including how they are served, scaled, and monitored in production environments.

Other signals

  • AI Model Serving team is the engine behind every production Workday agent and machine learning use case
  • own the services that power all production AI workloads
  • operating the model deployment platform where Workday hosts and scales its models
  • making our uniform vendor interface more robust, scaling our architecture to support more than 20 agents going into production
  • exploring hosting open-source LLMs in our stack
  • Your work will directly impact Workday's ability to serve AI at scale — from traditional ML models to the latest large language models powering Workday's agents.