Staff Machine Learning Engineer, ML Infrastructure

Unity Unity · Enterprise · Mountain View, CA · AI & Machine Learning

Staff ML Engineer focused on building and evolving a large-scale offline ML platform for data pipelines, distributed model training, and feature generation at Unity.

What you'd actually do

  1. Design and operate large-scale data pipelines that generate training datasets used for machine learning training and experimentation
  2. Develop infrastructure that supports distributed training workflows using technologies such as Pytorch, Ray Data, and Ray Train, etc.
  3. Integrate ML pipelines with workflow orchestration systems (e.g., Flyte, Airflow, or similar) to enable reliable multi-stage training workflows
  4. Improve reproducibility and observability of ML pipelines through dataset validation, monitoring, and automated testing
  5. Optimize performance and resource utilization across distributed compute systems used for data processing and model training

Skills

Required

  • Python
  • distributed computing frameworks (Ray, Spark, Flink)
  • Ray Data
  • Ray Train
  • workflow orchestration systems (Flyte, Airflow)
  • data lakes
  • data warehouses
  • streaming platforms
  • systems thinking
  • performance optimization
  • scalability
  • reliability
  • cost optimization

Nice to have

  • Pytorch

What the JD emphasized

  • strong technical ownership
  • design and evolve the large-scale offline platform
  • building reliable infrastructure
  • shaping how model datasets are prepared
  • model training, validated, and delivered
  • ensure the reliability, scalability, and performance
  • Strong experience building large-scale ML pipelines
  • Experience working with distributed computing frameworks
  • Experience building infrastructure for training data generation, dataset preparation, or ML feature pipelines
  • Deep experience designing and operating production-grade data pipelines
  • Strong programming skills in Python
  • Strong systems thinking
  • Proven ability to lead technical direction and influence architectural decisions

Other signals

  • ML platform
  • large-scale model training
  • feature generation
  • experimentation workflows