Senior Machine Learning Engineer - Training Platform (au Remote)

Canva Canva · Enterprise · Sydney, Australia · Information Technology

Senior Machine Learning Engineer role focused on designing, scaling, and maturing the systems and infrastructure for AI model training workloads at Canva. The role involves working on a Kubernetes-based training platform, improving reliability, observability, and operational support, and contributing to platform capabilities for scheduling, resource allocation, and experiment management. Collaboration with research scientists, AI engineers, and infrastructure teams is key to shaping the platform's roadmap and enabling AI-powered experiences.

What you'd actually do

  1. You’ll contribute to the evolution of Canva’s unified training platform for AI training workloads
  2. You’ll improve reliability, observability, debugging, and operational support for training systems
  3. You’ll design and build the platform capabilities that enable better scheduling at scale, including resource allocation, priority management, and quota management for training workloads.
  4. You’ll collaborate closely with research scientists, ML engineers, product teams, and cloud/infrastructure teams to improve training platform workflows and outcomes
  5. You’ll contribute to system design and architecture decisions across Canva’s AI Platform

Skills

Required

  • experience in training pipelines
  • distributed systems
  • large-scale AI infrastructure
  • Kubernetes
  • containerized workloads

Nice to have

  • Experience with training infrastructure
  • distributed frameworks such as Ray, PyTorch distributed training, or similar technologies
  • modern cloud and infrastructure services
  • high-performance storage
  • HPC environments
  • fast interconnects and networking capabilities
  • services such as FSx, EFA, and related infrastructure

What the JD emphasized

  • training infrastructure
  • distributed systems
  • large-scale AI infrastructure
  • Kubernetes
  • training infrastructure
  • distributed frameworks
  • high-performance storage
  • HPC environments
  • fast interconnects and networking capabilities
  • large-scale training environments
  • scalability
  • reliability
  • usability
  • developer experience
  • platform foundations

Other signals

  • training platform
  • distributed training systems
  • large-scale AI workloads
  • Kubernetes-based training platform