Senior Research Scientist - Reinforcement Learning, Moes

Canva Canva · Enterprise · London, United Kingdom · Information Technology

Senior Research Scientist focused on Reinforcement Learning, Mixture of Experts (MoEs), and Agentic Systems for multimodal applications. The role involves developing agent systems, scaling post-training and RL across distributed systems, contributing to research direction, building reward models, and developing evaluation methods. The goal is to convert research breakthroughs into reliable, safe, and high-quality product features.

What you'd actually do

  1. Develop agent systems (planning, multimodal tool use, retrieval, novel training approaches, modeling ablations) for real tasks in design, vision, and language.
  2. Scale post-training and RL across distributed systems (PyTorch) with efficient data loaders, tracing/telemetry, stable training of mixture-of-experts (MoE) architectures, and reproducible pipelines; profile, debug, and optimize.
  3. Contribute to the research agenda for RL/agentic systems aligned with Canva’s product goals; identify high‑leverage bets and retire dead ends quickly.
  4. Build reward models and learning loops: RLHF/RLAIF, preference modeling, DPO/IPO‑style objectives, offline/online RL, curriculum learning, and credit assignment for multi‑step reasoning.
  5. Develop simulation and sandbox tasks that surface failure modes (planning errors, tool‑use brittleness, hallucination, unsafe actions) and turn them into measurable targets.

Skills

Required

  • Implementing and post-training MoEs/LLMs/VLMs/Diffusion models
  • Modifying and adapting open-source models
  • Experimental design
  • Python
  • PyTorch
  • Building agent loops
  • Policy optimization
  • Reward modeling
  • Preference learning
  • Large-scale training
  • Cloud multimodal tooling
  • RL for MoE architectures

Nice to have

  • Video and audio modelling
  • Multi-agent settings
  • Alignment and safety evaluations
  • Red-teaming
  • Risk mitigation for tool-using agents
  • Contributions to open-source, benchmarks, or shared evaluation suites for agents

What the JD emphasized

  • track record of shipped research or publications in MoEs, RL or agents
  • Experience modifying, and adapting open-source models
  • Strong experience with experimental design
  • Fluency in Python and PyTorch
  • Practical experience building agent loops
  • Hands-on experience with policy optimization, reward modeling, and preference learning
  • Experience with large‑scale training
  • Experience with RL for MoE architectures

Other signals

  • Reinforcement Learning
  • Mixture of Experts (MoEs)
  • Agentic Systems
  • Multimodal Models
  • Post-training