Software Engineer L4/l5, Model Serving Systems, Machine Learning Platform

Netflix Netflix · Big Tech · United States · Remote · Data & Insights

Netflix is seeking a Software Engineer for their Machine Learning Platform team to develop and expand their model serving systems, focusing on infrastructure for LLMs and other large foundation models. The role involves building scalable, robust systems for online ML model inference, optimizing for latency and cost, and ensuring high availability and performance. This is a highly cross-functional role partnering with various engineering and data science teams.

What you'd actually do

  1. develop and expand our compute infrastructure to support the growing AI needs
  2. enable the application of ML in new business areas
  3. drive ML/AI innovation across Netflix
  4. partnering with other engineers, product managers, machine learning engineers, and data/research scientists

Skills

Required

  • building high-traffic distributed services and infrastructure for online ML model inference
  • supporting large-scale ML models focusing on high availability and performance
  • scalable model-serving solutions for generative models and LLMs
  • reducing latency and costs
  • solve bottlenecks to streamline research-to-production workflows
  • object-oriented programming (preferably Java)
  • engineering excellence in production hosting, including performance tuning, deployment management, and capacity planning
  • deploying ML models using tools like Triton Inference Server, TensorRT, Docker
  • working with the public cloud like AWS, Azure, or GCP
  • proactive communicator who promotes best practices in observability and logging

Nice to have

  • experience building and expanding compute infrastructure for AI needs
  • experience enabling LLM innovation
  • experience driving ML/AI innovation

What the JD emphasized

  • high-traffic distributed services and infrastructure for online ML model inference
  • supporting large-scale ML models focusing on high availability and performance
  • scalable model-serving solutions for generative models and LLMs
  • reducing latency and costs
  • solve bottlenecks to streamline research-to-production workflows
  • engineering excellence in production hosting, including performance tuning, deployment management, and capacity planning
  • observability and logging

Other signals

  • building model serving infrastructure for LLMs
  • enabling LLM innovation
  • serving ML models at scale
  • real-time model inference and serving platform