Staff Machine Learning Engineer, ML Infrastructure

Unity Unity · Enterprise · Mountain View, CA · AI & Machine Learning

Staff ML Engineer focused on building and operating a large-scale offline ML platform for data generation, feature engineering, and distributed model training at Unity.

What you'd actually do

  1. Design and operate large-scale data pipelines that generate training datasets used for machine learning training and experimentation
  2. Develop infrastructure that supports distributed training workflows using technologies such as Pytorch, Ray Data, and Ray Train, etc.
  3. Integrate ML pipelines with workflow orchestration systems (e.g., Flyte, Airflow, or similar) to enable reliable multi-stage training workflows
  4. Improve reproducibility and observability of ML pipelines through dataset validation, monitoring, and automated testing
  5. Optimize performance and resource utilization across distributed compute systems used for data processing and model training

Skills

Required

  • Python
  • distributed computing frameworks (Ray, Spark, Flink)
  • Ray Data
  • Ray Train
  • Pytorch
  • workflow orchestration systems (Flyte, Airflow)
  • data lakes
  • data warehouses
  • streaming platforms
  • systems thinking
  • technical leadership

What the JD emphasized

  • strong technical ownership
  • design and evolve the large-scale offline platform
  • building reliable infrastructure
  • orchestrating ML workflows
  • enabling efficient, distributed model training at scale
  • shaping how model datasets are prepared
  • model training, validated, and delivered
  • reliability, scalability, and performance
  • Strong experience building large-scale ML pipelines
  • Experience working with distributed computing frameworks
  • Experience building infrastructure for training data generation, dataset preparation, or ML feature pipelines
  • Deep experience designing and operating production-grade data pipelines
  • Strong programming skills in Python
  • experience working with large-scale distributed workloads
  • Strong systems thinking
  • Proven ability to lead technical direction and influence architectural decisions

Other signals

  • ML platform
  • large-scale model training
  • distributed training
  • data pipelines