Staff ML Software Engineer (l6) — Platform Systems, Aims Engineering

Netflix Netflix · Big Tech · Los Gatos, CA +1 · Data & Insights

Staff ML Software Engineer to own the technical health of Netflix's AIMS AI/ML stack, modernizing it and building observability and cost infrastructure. This role involves defining end-state architecture, driving migration of AI/ML systems, building migration tooling, owning scalability, designing observability systems, identifying cost optimizations, architecting reliability improvements, and prototyping GenAI-powered tooling for operational automation. Requires significant experience with large-scale production AI/ML systems, migration, Python expertise, distributed systems, and observability.

What you'd actually do

  1. Define the end-state architecture for the modernized AIMS AI/ML stack: how it is organized, what contracts each layer exposes, and what the migration path looks like across training pipelines, AI frameworks, and data infrastructure
  2. Drive end-to-end migration of AIMS AI/ML systems onto a modern, Python-native platform, coordinating across multiple AIMS teams and external platform partners, with dozens of production models in flight
  3. Build migration tooling and shared abstractions that reduce the cost of adoption for individual teams, so modernization does not require each team to solve the same problems independently
  4. Own scalability across training throughput and data pipelines, ensuring AIMS AI/ML systems stay performant as model complexity and member traffic grow
  5. Design and build observability systems that give AIMS AIMS ML practitioners deep visibility into model behavior, training pipeline health, serving latency, and data quality, making issues detectable and diagnosable before they become incidents

Skills

Required

  • Significant experience designing, building, and operating large-scale production AI/ML systems, including training pipelines and familiarity with model serving and online inference at high-traffic scale
  • Hands-on experience migrating production AI/ML systems across technology generations; you have done this before and understand where it goes wrong
  • Strong software engineering fundamentals with deep Python expertise and working proficiency in at least one JVM language (Scala or Java)
  • Proven track record of improving AI/ML system reliability, reducing infrastructure costs, and improving operational scalability
  • Experience building observability and monitoring systems for AI/ML workloads; you understand what good visibility looks like across training, serving, and data pipelines
  • Strong distributed systems background, including large-scale batch processing and real-time serving infrastructure
  • Collaborate with partner teams to drive cross-functional technical programs, setting direction, managing dependencies, and building consensus without formal authority
  • High technical judgment: able to identify common patterns, build reusable frameworks, and make pragmatic calls on what to migrate, what to rewrite, and what to leave alone
  • Comfortable operating without full information; you can scope a problem, define an approach, and course-correct as you learn more

Nice to have

  • Experience with compute and cost optimization for AI/ML workloads at scale, including capacity management and efficiency tooling
  • Hands-on experience building GenAI-powered tooling for operational automation, root cause analysis, or anomaly detection in AI/ML systems
  • Experience building developer tooling or platform abstractions that improve AI/ML practitioner velocity
  • Applied experience in personalization domains such as recommendation systems, search, or discovery
  • Familiarity with modern AI/ML infrastructure patterns

What the JD emphasized

  • Hands-on experience migrating production AI/ML systems across technology generations; you have done this before and understand where it goes wrong

Other signals

  • Migrating AI/ML platform
  • Building observability and cost infrastructure
  • Modernizing AI/ML stack
  • Owning technical health of AI/ML stack
  • High-leverage, cross-cutting role