Senior Machine Learning Engineer - Voice Experience

Cresta Cresta · Vertical AI · United States · Remote · Engineering

Senior Machine Learning Engineer focused on building and improving AI-powered voice systems for contact centers. This role involves working on ASR, speech understanding, agentic workflows, and real-time production systems, with a strong emphasis on model quality, evaluation, and production deployment.

What you'd actually do

  1. Design, train, evaluate, and deploy machine learning systems that power real-time voice experiences, including ASR, speech understanding, turn detection, text to speech, speech to speech, classification, entity extraction, summarization, and structured insight generation.
  2. Improve the quality of voice AI systems through error analysis, data curation, metric design, benchmarking, and iterative model improvement, with a strong focus on real-world performance.
  3. Build evaluation frameworks for complex voice and agentic systems, measuring metrics such as accuracy, robustness, latency, faithfulness, naturalness, professionalism, task completion, and cost.
  4. Diagnose and mitigate failure modes across the voice stack, including transcription errors, hallucinations, retrieval failures, tool misuse, prompt brittleness, context drift, and multi-step reasoning breakdowns.
  5. Design and optimize low-latency ML workflows for live conversations, balancing model quality with system responsiveness, scalability, and reliability.

Skills

Required

  • PyTorch
  • TensorFlow
  • Hugging Face
  • transformer-based models
  • embeddings
  • retrieval systems
  • large-scale training
  • inference workflows
  • real-time ML systems
  • latency
  • scalability
  • reliability
  • data pipelines
  • experimentation
  • measurement
  • quality analysis
  • speech recognition
  • speech processing
  • NLP
  • generative AI
  • conversational AI
  • model evaluation
  • benchmarking
  • error analysis
  • quality improvement
  • production ML systems

Nice to have

  • ASR quality metrics
  • WER
  • task-level evaluation methodologies
  • RAG systems
  • agentic workflows
  • multi-step reasoning systems
  • LLM-as-a-judge evaluation methods
  • streaming inference
  • real-time voice pipelines
  • media systems
  • infrastructure teams
  • platform teams
  • production ML deployment
  • observability
  • reliability
  • contact center AI
  • conversational intelligence
  • enterprise voice products

What the JD emphasized

  • real-time production systems
  • model quality
  • production reality
  • rigorous evaluation frameworks
  • failure modes
  • latency and robustness
  • reliably at scale in real-time voice environments
  • error analysis
  • metric design
  • benchmarking
  • iterative model improvement
  • real-world performance
  • complex voice and agentic systems
  • accuracy, robustness, latency, faithfulness, naturalness, professionalism, task completion, and cost
  • transcription errors, hallucinations, retrieval failures, tool misuse, prompt brittleness, context drift, and multi-step reasoning breakdowns
  • low-latency ML workflows
  • system responsiveness, scalability, and reliability
  • productionize real-time inference, streaming pipelines, quality monitoring, and continuous model iteration
  • offline evaluation, online experimentation, model validation, observability, and ongoing quality monitoring in production
  • 5+ years of experience building, evaluating, and deploying machine learning systems in production
  • Strong background in one or more of the following: speech recognition, speech processing, NLP, generative AI, or conversational AI
  • Deep experience with model evaluation, benchmarking, error analysis, and quality improvement for production ML systems
  • Solid understanding of transformer-based models, embeddings, retrieval systems, and large-scale training or inference workflows
  • Experience designing and deploying real-time ML systems with strong requirements around latency, scalability, and reliability
  • Experience building data pipelines and tooling for experimentation, measurement, and large-scale quality analysis
  • Ability to work across research and engineering boundaries and translate promising ideas into production-grade systems

Other signals

  • Develop and improve machine learning systems that power voice experiences end to end
  • Improve the quality of voice AI systems through error analysis, data curation, metric design, benchmarking, and iterative model improvement
  • Build evaluation frameworks for complex voice and agentic systems
  • Diagnose and mitigate failure modes across the voice stack
  • Design and optimize low-latency ML workflows for live conversations