Staff Machine Learning Engineer, ML Infrastructure

Unity Unity · Enterprise · Mountain View, CA · AI & Machine Learning

Staff ML Engineer focused on building and operating a large-scale offline ML platform for Unity, supporting data pipelines, distributed model training, and experimentation workflows.

What you'd actually do

  1. Design and operate large-scale data pipelines that generate training datasets used for machine learning training and experimentation
  2. Develop infrastructure that supports distributed training workflows using technologies such as Pytorch, Ray Data, and Ray Train, etc.
  3. Integrate ML pipelines with workflow orchestration systems (e.g., Flyte, Airflow, or similar) to enable reliable multi-stage training workflows
  4. Improve reproducibility and observability of ML pipelines through dataset validation, monitoring, and automated testing
  5. Optimize performance and resource utilization across distributed compute systems used for data processing and model training

Skills

Required

  • Python
  • distributed computing frameworks (Ray, Spark, Flink)
  • ML pipelines
  • data pipelines
  • workflow orchestration systems (Flyte, Airflow)
  • dataset validation
  • monitoring
  • automated testing
  • performance optimization
  • resource utilization
  • systems thinking
  • technical leadership

Nice to have

  • Ray Data
  • Ray Train

What the JD emphasized

  • strong technical ownership
  • design and evolve the large-scale offline platform
  • building reliable infrastructure
  • orchestrating ML workflows
  • enabling efficient, distributed model training at scale
  • shaping how model datasets are prepared
  • model training, validated, and delivered
  • ensuring the reliability, scalability, and performance
  • Strong experience building large-scale ML pipelines
  • Experience working with distributed computing frameworks
  • Experience building infrastructure for training data generation, dataset preparation, or ML feature pipelines
  • Deep experience designing and operating production-grade data pipelines
  • Strong programming skills in Python
  • experience working with large-scale distributed workloads
  • Strong systems thinking
  • Proven ability to lead technical direction and influence architectural decisions

Other signals

  • ML platform
  • large-scale model training
  • feature generation
  • experimentation workflows