What you'd actually do

Adapt models for new conditioning inputs (emotion, speed, prosody, speaker control, etc.).

Fine-tune and optimize speech models using advanced techniques such as DPO (Direct Preference Optimization), LoRA, and other parameter-efficient methods to improve voice quality and expressiveness.

Implement post-training optimization techniques (quantization, pruning, distillation) to improve efficiency and latency in real-time speech generation.

Integrate and test novel architectures, such as neural codecs, diffusion, or flow-matching models, to enhance realism and responsiveness.

Design and implement new evaluation metrics for TTS systems, including automated Mean Opinion Score (MOS) prediction models for continuous quality assessment.

Skills

Required

generative modelling
large language models (LLMs)
transformer-based architectures
PyTorch
distributed training
model optimization
time-series modelling
tokenization
audio
speech
prototyping
hypothesis testing
deep learning models end-to-end
data preparation
evaluation
software engineering

Nice to have

diffusion models
neural codecs
flow-matching models
autoregressive decoders
speech-to-speech
text-to-speech (TTS) systems
publications

Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company is headquartered in London, with offices and teams across Europe and the US.

As AI continues to shape the way we live and work, Synthesia develops products to enhance visual communication and enterprise skill development, helping people work better and stay at the center of successful organizations.

Following our recent Series E funding round, where we raised $200 million, our valuation stands at $4 billion. Our total funding exceeds $530 million from premier investors including Accel, NVentures (Nvidia's VC arm), Kleiner Perkins, GV, and Evantic Capital, alongside the founders and operators of Stripe, Datadog, Miro, and Webflow.

About the role

As a Research Engineer you will join a team of 40+ Researchers and Engineers within the R&D Department working on cutting-edge challenges in the Generative AI space, with a focus on creating high-quality, expressive and real-time synthetic voices. Within the team you’ll have the opportunity to work on the applied side of our research efforts and directly impact our solutions that are used worldwide by over 60,000 businesses.

If you are an expert in ML LLMs speech generation conversational models, this is your chance to make a global impact. You will join our Audio Post-Training Team, which works on generative speech and voice synthesis, ensuring our in-house voice models reach production-level quality, speed, and robustness. Typical projects include:

Adapt models for new conditioning inputs (emotion, speed, prosody, speaker control, etc.).
Fine-tune and optimize speech models using advanced techniques such as DPO (Direct Preference Optimization), LoRA, and other parameter-efficient methods to improve voice quality and expressiveness.
Implement post-training optimization techniques (quantization, pruning, distillation) to improve efficiency and latency in real-time speech generation.
Integrate and test novel architectures, such as neural codecs, diffusion, or flow-matching models, to enhance realism and responsiveness.
Design and implement new evaluation metrics for TTS systems, including automated Mean Opinion Score (MOS) prediction models for continuous quality assessment.
Stay updated with the latest research in audio diffusion, autoregressive models, neural codecs, and multimodal LLMs.

What we're looking for:

Strong understanding of generative modelling, ideally applied to sequential or multimodal data.
Hands-on experience with large language models (LLMs) or similar transformer-based architectures.
High proficiency in PyTorch, including experience with distributed training and model optimization.
Solid grasp of time-series modelling and tokenization, preferably in the context of audio or speech.
Demonstrated ability to prototype quickly, test hypotheses, and iterate efficiently.
Proven experience in training deep learning models end-to-end, from data preparation to evaluation.
Strong general software engineering skills, enabling contributions to a large, shared research infrastructure.

Nice-to have experience

Familiarity with state-of-the-art architectures in audio and speech generation (e.g., diffusion models, neural codecs, flow-matching models, autoregressive decoders).
Experience with speech-to-speech or text-to-speech (TTS) systems.
Evidence of original research contributions, such as publications or open-source work in top-tier venues (e.g., ICASSP, Interspeech, NeurIPS, ICML).

Why join us?

We’re living the golden age of AI. The next decade will yield the next iconic companies, and we dare to say we have what it takes to become one. Here’s why,

Our culture

At Synthesia we’re passionate about building, not talking, planning or politicising. We strive to hire the smartest, kindest and most unrelenting people and let them do their best work without distractions. Our work principles serve as our charter for how we make decisions, give feedback and structure our work to empower everyone to go as fast as possible. **You can find out more about these principles here.**

Serving 50,000+ customers (and 50% of the Fortune 500)

We’re trusted by leading brands such as Heineken, Zoom, Xerox, McDonald’s and more. Read stories from happy customers and what 1,200+ people say on G2.

Proprietary AI technology

Since 2017, we’ve been pioneering advancements in Generative AI. Our AI technology is built in-house, by a team of world-class AI researchers and engineers. Learn more about our AI Research Lab and the team behind.

AI Safety, Ethics and Security

AI safety, ethics, and security are fundamental to our mission. While the full scope of Artificial Intelligence's impact on our society is still unfolding, our position is clear: People first. Always. Learn more about our commitments to AI Ethics, Safety & Security.

The good stuff...

Competitive compensation (salary + stock options + bonus)
Fully remote from Europe or hybrid work setting with an office in London, Amsterdam, Zurich, Munich
25 days of annual leave + public holidays
Great company culture with the option to join regular planning and socials at our hubs
- other benefits depending on your location

You can see more about Who we are and How we work here:https://www.synthesia.io/careers

LI-MD1

About the role

Adapt models for new conditioning inputs (emotion, speed, prosody, speaker control, etc.).
Fine-tune and optimize speech models using advanced techniques such as DPO (Direct Preference Optimization), LoRA, and other parameter-efficient methods to improve voice quality and expressiveness.
Implement post-training optimization techniques (quantization, pruning, distillation) to improve efficiency and latency in real-time speech generation.
Integrate and test novel architectures, such as neural codecs, diffusion, or flow-matching models, to enhance realism and responsiveness.
Design and implement new evaluation metrics for TTS systems, including automated Mean Opinion Score (MOS) prediction models for continuous quality assessment.
Stay updated with the latest research in audio diffusion, autoregressive models, neural codecs, and multimodal LLMs.

What we're looking for:

Strong understanding of generative modelling, ideally applied to sequential or multimodal data.
Hands-on experience with large language models (LLMs) or similar transformer-based architectures.
High proficiency in PyTorch, including experience with distributed training and model optimization.
Solid grasp of time-series modelling and tokenization, preferably in the context of audio or speech.
Demonstrated ability to prototype quickly, test hypotheses, and iterate efficiently.
Proven experience in training deep learning models end-to-end, from data preparation to evaluation.
Strong general software engineering skills, enabling contributions to a large, shared research infrastructure.

Nice-to have experience

Familiarity with state-of-the-art architectures in audio and speech generation (e.g., diffusion models, neural codecs, flow-matching models, autoregressive decoders).
Experience with speech-to-speech or text-to-speech (TTS) systems.
Evidence of original research contributions, such as publications or open-source work in top-tier venues (e.g., ICASSP, Interspeech, NeurIPS, ICML).

Why join us?

We’re living the golden age of AI. The next decade will yield the next iconic companies, and we dare to say we have what it takes to become one. Here’s why,

Our culture

Serving 50,000+ customers (and 50% of the Fortune 500)

We’re trusted by leading brands such as Heineken, Zoom, Xerox, McDonald’s and more. Read stories from happy customers and what 1,200+ people say on G2.

Proprietary AI technology

AI Safety, Ethics and Security

The good stuff...

Competitive compensation (salary + stock options + bonus)
Fully remote from Europe or hybrid work setting with an office in London, Amsterdam, Zurich, Munich
25 days of annual leave + public holidays
Great company culture with the option to join regular planning and socials at our hubs
- other benefits depending on your location

You can see more about Who we are and How we work here:https://www.synthesia.io/careers

LI-MD1

Senior Research Engineer - Audio Post-training

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

About the role

What we're looking for:

Nice-to have experience

Why join us?

Our culture

Serving 50,000+ customers (and 50% of the Fortune 500)

Proprietary AI technology

AI Safety, Ethics and Security

The good stuff...

About the role

What we're looking for:

Nice-to have experience

Why join us?

Our culture

Serving 50,000+ customers (and 50% of the Fortune 500)

Proprietary AI technology

AI Safety, Ethics and Security

The good stuff...