ML Platform Engineer

Synthesia Synthesia · Multimodal · EUROPE · Research and Development

ML Platform Engineer at Synthesia, a leading AI video platform. The role focuses on building and operating systems for training, serving, and deploying generative models reliably and efficiently. This includes research infrastructure, production serving systems, and internal tooling, with a growing emphasis on agent-oriented workflows. The role requires a strong generalist with a systems mindset, comfortable working across infrastructure, backend systems, and tooling, with a focus on reliability, scalability, performance, and resource efficiency in complex production environments.

What you'd actually do

  1. Design and improve the platform systems that support model training, evaluation, and production serving.
  2. Build infrastructure and tooling that make ML workloads more reliable, scalable, and cost-efficient.
  3. Develop internal tools and workflows that are easy to operate both by humans and by agents.
  4. Work on the architecture behind how models are deployed, served, and operated across research and product environments.
  5. Improve how we schedule, monitor, and debug workloads running on GPUs and cloud infrastructure.

Skills

Required

  • Strong experience building or operating production systems with a focus on reliability, scalability, and maintainability.
  • A systems mindset: you naturally think in terms of bottlenecks, failure modes, interfaces, resource usage, and long-term operability.
  • Solid hands-on experience with cloud infrastructure, Linux, and infrastructure automation.
  • Experience with Kubernetes and operating distributed workloads in production.
  • Strong coding skills, ideally in Python or similar languages used for backend systems and tooling.
  • Strong judgment around where automation adds leverage, and where human control and reliability matter most.
  • Experience building internal platforms, developer tooling, or infrastructure abstractions used by other engineers.
  • Comfort working in ambiguous environments and taking ownership of open-ended technical problems.
  • A pragmatic approach: you care about solving the right problem well, not over-engineering.

Nice to have

  • Operating ML infrastructure or model serving systems in production.
  • Supporting research or data-intensive workloads.
  • Working with GPU-based systems or other performance-sensitive infrastructure.
  • Experience with observability and debugging in distributed systems.
  • Familiarity with Terraform, Datadog, GitHub Actions, or similar tools.
  • Experience building agentic or LLM-powered internal tools.
  • Experience with workflow orchestration systems such as Temporal.
  • Experience working at the boundary between research and production engineering.
  • Familiarity with performance optimization, scheduling, or resource allocation problems.
  • Experience building lightweight product or developer-facing tools.

What the JD emphasized

  • train, serve, and deploy
  • agent-oriented
  • both by humans and by agents

Other signals

  • ML Platform
  • train, serve, and deploy generative models
  • agent-oriented
  • production serving systems
  • internal tooling