Principal Machine Learning Engineer, Mobile AI Inference Optimization

Unity Unity · Enterprise · Mountain View, CA · AI & Machine Learning

Principal Machine Learning Engineer focused on optimizing multi-modal AI model inference for mobile on-device deployment at Unity. This role involves technical leadership in model compression, quantization, pruning, knowledge distillation, and selecting inference runtimes. The engineer will own the end-to-end optimization pipeline, translate research into deployable implementations, and mentor a team. Requires 8+ years of ML engineering experience with a focus on on-device inference optimization and production deployment of transformer or generative models on mobile hardware.

What you'd actually do

  1. Set the technical vision and roadmap for deploying multi-modal AI models to iOS and Android, spanning transformers, diffusion models, and JAPE-style generative architectures.
  2. Make authoritative decisions on model compression, quantization, pruning, and knowledge distillation strategies to meet mobile latency and memory budgets.
  3. Evaluate and select inference runtimes (e.g., CoreML, ONNX Runtime Mobile, TFLite, ExecuTorch) and drive adoption across the team.
  4. Own the end-to-end optimization pipeline: from model export and graph transformation to hardware-specific kernel tuning on NPU, GPU, and CPU.
  5. Collaborate directly with research scientists to translate novel model architectures into deployable, mobile-optimized implementations.

Skills

Required

  • 8+ years in ML engineering
  • 3+ years focused on on-device / edge inference optimization
  • Production deployment of transformer-based models (e.g., ViT, LLaMA, Stable Diffusion) and/or JAPE-style generative architectures on mobile or embedded hardware
  • CoreML, TFLite, ONNX Runtime, and/or ExecuTorch expertise
  • Operator fusion, memory layout, and runtime scheduling understanding
  • INT8/INT4/FP16 quantization, weight sharing, structured/unstructured pruning, and knowledge distillation expertise
  • Mobile SoC architectures (Apple Neural Engine, Qualcomm Hexagon/Adreno, ARM Mali) understanding
  • C++ / Objective-C / Swift proficiency
  • Python proficiency
  • Ability to read, implement, and extend ML research papers
  • Familiarity with efficient attention, diffusion samplers, and multi-modal fusion techniques
  • Technical leadership experience

Nice to have

  • Experience shipping world-model or neural rendering pipelines (NeRF, 3DGS, or similar) on mobile
  • Contributions to open-source ML inference frameworks
  • Mobile ML research publications
  • MLIR, TVM, or XLA familiarity
  • Real-time graphics or game engine pipelines (Metal, Vulkan, OpenGL ES) background

What the JD emphasized

  • Proven production deployment of transformer-based models (e.g., ViT, LLaMA, Stable Diffusion) and/or JAPE-style generative architectures on mobile or embedded hardware.
  • Expert-level command of INT8/INT4/FP16 quantization, weight sharing, structured/unstructured pruning, and knowledge distillation.
  • Strong understanding of mobile SoC architectures (Apple Neural Engine, Qualcomm Hexagon/Adreno, ARM Mali) and how to target each for peak throughput.

Other signals

  • Deploying state-of-the-art multi-modal models to mobile on-device
  • Define the inference strategy, drive architectural decisions across the full mobile ML stack
  • Mentor a team of senior and mid-level engineers
  • Optimize latency, quality, and power profile of AI-driven features