Machine Learning Engineer 5 - Globalization

Netflix Netflix · Big Tech · United States · Remote · Data & Insights

Machine Learning Engineer at Netflix focused on optimizing training and inference efficiency for LLMs and Multimodal LLMs within the Globalization team. The role involves designing and building scalable systems, optimizing data pipelines, distributed training, mixed precision, KV cache, batching, and quantization to improve performance, latency, and reliability of ML models for Netflix's global catalog.

What you'd actually do

  1. Design and build scalable training and inference systems for LLMs, Multimodal LLMs, and other media ML models.
  2. Optimize end-to-end training: data pipelines (streaming, sharding, bucketing), distributed training (parallelism strategies), and mixed precision.
  3. Optimize inference and serving: KV cache, batching, quantization, and long-context handling.
  4. Scale model training and inference into robust, performant systems integrated into Netflix workflows.
  5. Act as a technical thought leader for training and inference efficiency, driving initiatives that significantly improve scalability, latency, and reliability.

Skills

Required

  • ML engineering for large, production-grade systems
  • LLMs
  • Multimodal LLMs
  • media ML models
  • training optimization
  • high-throughput data loading
  • distributed training
  • GPU/accelerator optimization
  • inference optimization
  • KV cache design and optimization
  • batching and scheduling for high-throughput, low-latency serving
  • quantization
  • model compression
  • PyTorch
  • software engineering fundamentals
  • testing
  • observability
  • performance profiling
  • leading ML initiatives
  • stakeholder partnership
  • communication and collaboration skills
  • ambiguity tolerance
  • high ownership

Nice to have

  • technical thought leadership
  • mentoring engineers and scientists

What the JD emphasized

  • Extensive experience in ML engineering for large, production-grade systems using LLMs, Multimodal LLMs, and other media ML models.
  • Deep hands-on expertise in training optimization: high-throughput data loading (streaming, sharding, bucketing); distributed training (parallelism strategies); GPU/accelerator optimization.
  • Strong experience in inference optimization: KV cache design and optimization; batching and scheduling for high-throughput, low-latency serving; quantization and/or model compression.

Other signals

  • LLM training and inference efficiency
  • production-ready ML solutions
  • scalable training and inference systems
  • optimize end-to-end training
  • optimize inference and serving