Senior Machine Learning Engineer, ML Infrastructure - Offline

Unity Unity · Enterprise · Shanghai, China · AI & Machine Learning

Senior ML Engineer focused on building and operating a large-scale offline ML platform for Unity. The role involves designing and evolving data pipelines for training datasets, orchestrating ML workflows, and enabling efficient, distributed model training. Key responsibilities include developing infrastructure for distributed training, integrating with orchestration systems, and optimizing performance.

What you'd actually do

  1. Design and operate large-scale data pipelines that generate training datasets used for machine learning training and experimentation
  2. Develop infrastructure that supports distributed training workflows using technologies such as Pytorch, Ray Data, and Ray Train, etc.
  3. Integrate ML pipelines with workflow orchestration systems (e.g., Flyte, Airflow, or similar) to enable reliable multi-stage training workflows
  4. Improve reproducibility and observability of ML pipelines through dataset validation, monitoring, and automated testing
  5. Optimize performance and resource utilization across distributed compute systems used for data processing and model training

Skills

Required

  • Experience working with distributed computing frameworks such as Ray, Spark, Flink and familiarity in the Ray ecosystem (Ray Data, Ray Train) for distributed data processing and model training
  • Experience building and optimizing large-scale distributed ML training pipelines with Torch Compilation, Quantization, CUDA, GPU kernel optimization etc.
  • Experience building infrastructure for training data generation, dataset preparation, or ML feature pipelines
  • Deep experience designing and operating production-grade data pipelines
  • Strong programming skills in Python and experience working with large-scale distributed workloads
  • Experience with modern data infrastructure (data lakes, warehouses, orchestration systems, streaming platforms)
  • Strong systems thinking, with the ability to reason about performance, scalability, reliability, and cost tradeoffs in distributed systems
  • Proven ability to lead technical direction and influence architectural decisions across teams without formal authority

What the JD emphasized

  • large-scale offline platform
  • large-scale model training
  • distributed training
  • large-scale experimentation
  • large-scale distributed ML training pipelines
  • production-grade data pipelines
  • large-scale distributed workloads

Other signals

  • ML Infrastructure
  • Large-scale model training
  • Data pipelines for ML
  • Distributed computing