Staff Machine Learning Engineer - Computer Vision & Multi-modal AI

Unity Unity · Enterprise · Mountain View, CA · AI & Machine Learning

Staff ML Engineer focused on Computer Vision and Multi-Modal AI for game experiences. The role involves translating research into production systems, defining modeling and deployment strategies, and leading technical decisions across the ML stack. Responsibilities include designing and implementing models for image/video understanding and generation, optimizing for various deployment targets, and mentoring engineers. The goal is to enhance AI features for billions of players.

What you'd actually do

  1. Help set the technical vision and roadmap for computer vision and multi-modal AI models, spanning transformers, diffusion models, vision-language models, and JEPA-style generative architectures.
  2. Drive design and implementation of models for image and video understanding, generation, segmentation, detection, and dense prediction, as well as multi-modal reasoning over images, text, and 3D inputs.
  3. Make sound decisions on model architecture, training strategy, data pipelines, and evaluation — balancing quality, capability, latency, and cost across deployment targets.
  4. Own the path from research prototype to production: training, fine-tuning, distillation, export, and serving, with deployment spanning cloud GPUs through to efficient on-device inference where the product requires it.
  5. Collaborate directly with research scientists to translate novel CV and multi-modal model architectures into deployable, well-engineered implementations.

Skills

Required

  • 6+ years in ML engineering
  • significant depth in computer vision and/or multi-modal modeling
  • Proven production experience with transformer-based and diffusion-based vision models (e.g., ViT, CLIP/SigLIP-style encoders, Stable Diffusion, DETR/SAM-style architectures)
  • Strong command of the full model lifecycle: data curation, training and fine-tuning, evaluation, and serving at scale.
  • Familiarity with efficient attention, diffusion samplers, multi-modal fusion, and vision-language alignment techniques.
  • Strong Python and modern deep-learning tooling (PyTorch); solid software engineering fundamentals.
  • Track record of technical leadership: setting direction, influencing cross-functional partners, and growing engineers.

Nice to have

  • Experience with world-model, video-generation, or neural rendering pipelines (NeRF, 3DGS, or similar).
  • Experience deploying models to constrained or on-device targets, including quantization (INT8/INT4/FP16), pruning, distillation, and runtimes such as CoreML, TFLite, ONNX
  • Familiarity with mobile SoC accelerators (Apple Neural Engine, Qualcomm Hexagon/Adreno, ARM Mali) or compiler stacks such as MLIR, TVM, or XLA.
  • Contributions to open-source ML frameworks or peer-reviewed CV/ML research publications.
  • Background in real-time graphics or game engine pipelines (Metal, Vulkan, OpenGL ES).

What the JD emphasized

  • production-grade systems
  • state-of-the-art computer vision and multi-modal models
  • production
  • training, fine-tuning, distillation, export, and serving
  • efficient on-device inference
  • novel CV and multi-modal model architectures
  • multi-modal inference
  • efficient attention
  • compression, quantization, pruning, and knowledge distillation
  • efficient diffusion
  • vision-language pretraining and alignment

Other signals

  • production-grade systems
  • state-of-the-art models
  • billions of players