Staff Machine Learning Engineer, ML Infrastructure - Online

Unity Unity · Enterprise · Shanghai, China · AI & Machine Learning

Staff ML Engineer focused on building and operating the online ML inference platform at Unity. This role involves designing, optimizing, and scaling infrastructure for serving production ML models with low latency and high reliability, supporting experimentation, and improving observability. The focus is on the infrastructure that enables ML models to be deployed and run efficiently in a production environment.

What you'd actually do

  1. Design and operate large-scale online inference infrastructure that serves production ML models with low latency and high reliability.
  2. Build and improve model serving systems using technologies such as PyTorch, Triton Inference Server, Kubernetes, GKE, Ray, or similar distributed serving frameworks.
  3. Optimize inference performance through batching, model compilation, GPU/CPU utilization improvements, request scheduling, and runtime-level tuning.
  4. Develop infrastructure for model deployment, canary testing, A/B experimentation, traffic splitting, rollback, and production validation.
  5. Improve observability of online ML systems through latency, throughput, error-rate, cost, saturation, and model-health monitoring.

Skills

Required

  • Building and operating production-grade online ML inference systems
  • Model serving frameworks (NVIDIA Triton Inference Server, TorchServe, Ray Serve, TensorFlow Serving, or similar)
  • Optimizing inference workloads (dynamic batching, model compilation, quantization, GPU acceleration, GPU kernel optimization, caching, runtime tuning)
  • Distributed systems
  • Kubernetes
  • Autoscaling
  • Service reliability
  • Production observability
  • Python programming
  • Production ML systems and high-scale services
  • PyTorch
  • Modern model deployment workflows (packaging, validation, serving lifecycle management)
  • Infrastructure for safe model rollout, canary testing, A/B experimentation, and automated rollback
  • Systems thinking (latency, throughput, reliability, scalability, cost tradeoffs)
  • Technical leadership and influencing architectural decisions

Nice to have

  • GKE
  • Ray

What the JD emphasized

  • strong technical ownership
  • reliable infrastructure for serving machine learning models in production
  • optimizing inference performance
  • safe, efficient experimentation across high-traffic online systems
  • strong systems thinking
  • deep experience with production ML infrastructure
  • drive architectural improvements across teams
  • production-grade online ML inference systems
  • model serving frameworks
  • optimizing inference workloads
  • distributed systems, Kubernetes, autoscaling, service reliability, and production observability
  • production ML systems and high-scale services
  • modern model deployment workflows
  • safe model rollout, canary testing, A/B experimentation, and automated rollback
  • reason about latency, throughput, reliability, scalability, and cost tradeoffs in online systems
  • Lead technical direction and influence architectural decisions across teams without formal authority

Other signals

  • online ML systems serve production models at scale
  • low-latency inference
  • large-scale experimentation
  • model deployment and optimization
  • feature processing
  • business-critical decisioning
  • inference platform must remain reliable, scalable, observable, and cost-efficient
  • building reliable infrastructure for serving machine learning models in production
  • optimizing inference performance
  • enabling safe, efficient experimentation across high-traffic online systems