Senior Machine Learning Engineer, ML Infrastructure - Online

Unity Unity · Enterprise · Shanghai, China · AI & Machine Learning

Senior/Staff ML Engineer to design and evolve Unity Vector’s online model inference platform. Focuses on building reliable infrastructure for serving ML models in production, optimizing inference performance, and enabling safe, efficient experimentation across high-traffic online systems. Requires strong systems thinking, deep experience with production ML infrastructure, and ability to drive architectural improvements.

What you'd actually do

  1. Design and operate large-scale online inference infrastructure that serves production ML models with low latency and high reliability.
  2. Build and improve model serving systems using technologies such as PyTorch, Triton Inference Server, Kubernetes, GKE, Ray, or similar distributed serving frameworks.
  3. Optimize inference performance through batching, model compilation, GPU/CPU utilization improvements, request scheduling, and runtime-level tuning.
  4. Develop infrastructure for model deployment, canary testing, A/B experimentation, traffic splitting, rollback, and production validation.
  5. Improve observability of online ML systems through latency, throughput, error-rate, cost, saturation, and model-health monitoring.

Skills

Required

  • Python
  • PyTorch
  • NVIDIA Triton Inference Server
  • Kubernetes
  • GKE
  • Ray
  • distributed systems
  • autoscaling
  • service reliability
  • production observability
  • model serving frameworks
  • inference optimization
  • model deployment
  • canary testing
  • A/B experimentation
  • rollback
  • production validation
  • systems thinking

Nice to have

  • TorchServe
  • TensorFlow Serving
  • model compilation
  • quantization
  • GPU acceleration
  • GPU kernel optimization
  • caching
  • runtime tuning

What the JD emphasized

  • strong technical ownership
  • design and evolve Unity Vector’s online model inference platform
  • building reliable infrastructure for serving machine learning models in production
  • optimizing inference performance
  • enabling safe, efficient experimentation across high-traffic online systems
  • ensure models can be deployed, scaled, monitored, and iterated on efficiently
  • shaping how models are packaged, served, validated, monitored, and optimized in production environments
  • strong systems thinking
  • deep experience with production ML infrastructure
  • ability to drive architectural improvements across teams
  • Strong experience building and operating production-grade online ML inference systems.
  • Experience with model serving frameworks such as NVIDIA Triton Inference Server, TorchServe, Ray Serve, TensorFlow Serving, or similar systems.
  • Experience optimizing inference workloads using techniques such as dynamic batching, model compilation, quantization, GPU acceleration, GPU kernel optimization, caching, or runtime tuning.
  • Strong experience with distributed systems, Kubernetes, autoscaling, service reliability, and production observability.
  • Strong programming skills in Python, with practical experience working on production ML systems and high-scale services.
  • Experience with PyTorch and modern model deployment workflows, including model packaging, validation, and serving lifecycle management.
  • Experience designing infrastructure for safe model rollout, canary testing, A/B experimentation, and automated rollback.
  • Strong systems thinking, with the ability to reason about latency, throughput, reliability, scalability, and cost tradeoffs in online systems.
  • Proven ability to lead technical direction and influence architectural decisions across teams without formal authority.

Other signals

  • online ML systems
  • production models at scale
  • low-latency inference
  • large-scale experimentation
  • model deployment and optimization
  • feature processing
  • business-critical decisioning
  • inference platform
  • reliable, scalable, observable, and cost-efficient
  • online model inference platform
  • serving machine learning models in production
  • optimizing inference performance
  • safe, efficient experimentation
  • high-traffic online systems
  • deploy, scale, monitor, and iterate efficiently
  • package, served, validated, monitored, and optimized
  • systems thinking
  • production ML infrastructure
  • architectural improvements