Staff Machine Learning Engineer, ML Infrastructure - Offline

Unity Unity · Enterprise · Shanghai, China · AI & Machine Learning

Staff ML Engineer focused on building and evolving the large-scale offline ML platform for data generation, workflow orchestration, and distributed model training at Unity.

What you'd actually do

  1. Design and operate large-scale data pipelines that generate training datasets used for machine learning training and experimentation
  2. Develop infrastructure that supports distributed training workflows using technologies such as Pytorch, Ray Data, and Ray Train, etc.
  3. Integrate ML pipelines with workflow orchestration systems (e.g., Flyte, Airflow, or similar) to enable reliable multi-stage training workflows
  4. Improve reproducibility and observability of ML pipelines through dataset validation, monitoring, and automated testing
  5. Optimize performance and resource utilization across distributed compute systems used for data processing and model training

Skills

Required

  • Python
  • distributed computing frameworks (Ray, Spark, Flink)
  • Ray ecosystem (Ray Data, Ray Train)
  • workflow orchestration systems (Flyte, Airflow)
  • data pipeline design and operation
  • ML infrastructure
  • training data generation
  • dataset preparation
  • ML feature pipelines
  • systems thinking
  • performance optimization
  • scalability
  • reliability
  • cost optimization

Nice to have

  • Pytorch

What the JD emphasized

  • strong technical ownership
  • design and evolve the large-scale offline platform
  • building reliable infrastructure
  • shaping how model datasets are prepared
  • model training, validated, and delivered
  • ensuring the reliability, scalability, and performance
  • large-scale ML pipelines
  • distributed computing frameworks
  • infrastructure for training data generation
  • production-grade data pipelines
  • large-scale distributed workloads
  • modern data infrastructure
  • systems thinking
  • lead technical direction
  • influence architectural decisions

Other signals

  • ML infrastructure
  • distributed training
  • data pipelines