Software Engineer 5 – Model Runtime, AI Platform

Netflix Netflix · Big Tech · United States · Remote · Data & Insights

Software Engineer 5 on the Model Runtime team at Netflix, focusing on building and optimizing infrastructure for training, alignment (RLHF, DPO, PPO), and serving of ML models, including multimodal and diffusion models. The role involves deep systems programming, distributed training at scale, and performance tuning across the full stack, from PyTorch to GPU kernels.

What you'd actually do

  1. Build alignment and post-training infrastructure — Design infrastructure for reinforcement learning (GRPO, DPO, PPO), reward modeling, and preference optimization so Netflix can train recommendation models directly against what members actually value.
  2. Enable next-generation GenAI workloads — Create infrastructure for multimodal and diffusion models, including distributed training, disaggregated serving, real-time, near-real-time and batch inference, and asynchronous GPU pipelines.
  3. Scale distributed training — Engineer fault-tolerant training systems using FSDP, tensor/pipeline/context parallelism, and mixed-precision strategies across clusters of hundreds of GPUs.
  4. Optimize across the full stack — Profile and tune from PyTorch operators down to GPU kernels, driving utilization improvements and building cost models that inform infrastructure strategy.
  5. Evaluate emerging hardware and frameworks — Be the team's eyes on specialized accelerators, next-gen NVIDIA silicon, and the open-source ecosystem to keep Netflix at the efficiency frontier.

Skills

Required

  • Experience in ML systems engineering — building infrastructure for training, fine-tuning, or inference of pre-LLM and post-LLM era models at scale.
  • Strong systems programming skills with the ability to work across multiple layers of the stack, from high-level ML frameworks down to GPU kernels and memory management
  • Hands-on experience with PyTorch internals, large-scale distributed training and system-model codesign
  • Comfortable with ambiguity and working across multiple business and technical domains to execute on both 0-to-1 and 1-to-100 projects
  • Adopt and promote best practices in operations, including observability, logging, reporting, and on-call processes to ensure engineering excellence
  • Experience with cloud computing providers, preferably AWS
  • Excellent written and verbal communication skills
  • Strong communication skills; effective across distributed time zones and remote environments

Nice to have

  • Deep experience with distributed training at scale (FSDP, parallelism strategies, checkpointing) or LLM post-training (SFT, RLHF, DPO/GRPO)
  • Inference optimization — vLLM, TensorRT, quantization, continuous batching, KV-cache management
  • GPU performance profiling and tuning (CUDA, NCCL, Nsight, PyTorch profiler)
  • Experience with multimodal or diffusion model architectures and generation pipelines
  • Track record building reusable ML libraries or contributing to open-source ML projects

What the JD emphasized

  • building infrastructure for training, fine-tuning, or inference of pre-LLM and post-LLM era models at scale
  • Deep experience with distributed training at scale
  • LLM post-training (SFT, RLHF, DPO/GRPO)
  • Inference optimization
  • GPU performance profiling and tuning

Other signals

  • building infrastructure for training, fine-tuning, or inference of pre-LLM and post-LLM era models at scale
  • distributed training
  • inference optimization
  • GPU kernels
  • PyTorch internals