Senior/staff Machine Learning Engineer, Data Infrastructure

Unity Unity · Enterprise · Shanghai, China · AI & Machine Learning

This role focuses on building and evolving a large-scale offline data platform for Unity, specifically for generating data infrastructure, training datasets, and orchestrating data workflows. The engineer will work with ML engineers and platform teams to ensure pipelines are reliable, scalable, and efficient for growing data volumes and complex training workloads, playing a key role in preparing model datasets for production ML systems.

What you'd actually do

  1. Develop infrastructure that supports both batch and stream big data processing using technologies such as Flink, Spark, Ray, etc.
  2. Design and operate large-scale data pipelines that generate training datasets used for machine learning training and experimentation
  3. Integrate data pipelines with workflow orchestration systems (e.g., Flyte, Airflow, or similar) to enable reliable multi-stage training workflows
  4. Improve reproducibility and observability of data pipelines through dataset validation, monitoring, and automated testing
  5. Optimize performance and resource utilization across distributed compute systems used for data processing

Skills

Required

  • Experience working with distributed computing frameworks such as Flink, Spark, Ray for distributed data processing
  • Experience building infrastructure for training data generation, dataset preparation, or ML feature pipelines
  • Experience optimizing big data pipelines and infrastructure for cost efficiency
  • Strong programming skills in Python and experience working with large-scale distributed workloads
  • Experience with modern data infrastructure (data lakes, warehouses, orchestration systems, streaming platforms)
  • Strong systems thinking, with the ability to reason about performance, scalability, reliability, and cost tradeoffs in distributed systems
  • Proven ability to lead technical direction and influence architectural decisions across teams without formal authority

What the JD emphasized

  • strong technical ownership
  • large-scale offline platform
  • large-scale model training
  • training datasets
  • production ML systems
  • large-scale data pipelines
  • multi-stage training workflows
  • large-scale distributed workloads

Other signals

  • ML pipelines
  • training datasets
  • large-scale model training
  • feature generation
  • experimentation workflows
  • production ML systems