Software Engineer L4/l5 Training Platform, Machine Learning Platform

Netflix Netflix · Big Tech · United States · Remote · Data & Insights

Software Engineer on the Machine Learning Platform (MLP) team, responsible for designing and building the platform that powers large-scale machine learning model training, fine-tuning, model transformation, and evaluations workflows for the entire company. Focuses on optimizing systems and models for scale and cost-effectiveness, and designing user-friendly APIs for ML practitioners.

What you'd actually do

  1. Design and build the platform that powers large-scale machine learning model training, fine-tuning, model transformation and evaluations workflows and use cases from the entire company
  2. Co-design and optimize the systems and models to scale up and increase the cost-effectiveness of machine learning model training
  3. Design easy-to-use APIs and interfaces for experienced ML practitioners, as well as non-experts to easy access the training platform

Skills

Required

  • ML engineering on production systems
  • building and operating large-scale infrastructure for machine learning
  • cloud computing providers (AWS preferred)
  • ambiguity and working across multiple layers of the tech stack
  • observability, logging, reporting, and on-call processes
  • modern and real-world Machine Learning model development workflows
  • partnering closely with ML modeling engineers

Nice to have

  • cloud-based AI/ML services (e.g., SageMaker, Bedrock, Databricks, OpenAI, etc.)
  • large-scale distributed training and different parallelism techniques (FSDP, tensor/pipeline parallelism)
  • Generative AI expertise
  • training foundation models
  • fine tuning them
  • distilling them to smaller models

What the JD emphasized

  • large-scale infrastructure for machine learning use cases
  • training or inference of deep learning models
  • large-scale distributed training
  • training foundation models
  • fine tuning them
  • distilling them to smaller models

Other signals

  • ML infrastructure
  • large-scale model training
  • fine-tuning
  • evaluations workflows