Senior Machine Learning Engineer, On-device & Mobile AI Optimization

Unity Unity · Enterprise · Mountain View, CA · AI & Machine Learning

Senior Machine Learning Engineer focused on optimizing and deploying state-of-the-art multi-modal AI models (transformers, diffusion, VLMs) for on-device execution on mobile and constrained hardware. This involves model export, quantization, graph transformation, operator fusion, kernel-level tuning (WebGPU, Metal, Vulkan, CUDA), and integration with game engines, aiming for low latency, small memory footprint, and reliable performance within strict budgets.

What you'd actually do

  1. Own the optimization pipeline for the models you ship: model export, graph transformation, operator fusion, memory-layout planning, and hardware-specific tuning across NPU, mobile GPU, and desktop/laptop GPU.
  2. Apply quantization (INT4/INT8/FP16), weight sharing, structured/unstructured pruning, and knowledge distillation to hit hard latency, memory, and power budgets — and validate them against quality bars.
  3. Do low-level performance work: write and tune WebGPU compute shaders (WGSL) and, where relevant, native kernels (Metal, Vulkan/SPIR-V compute, CUDA); profile with browser and platform tools (Chrome/Dawn GPU traces, PIX, Instruments/Metal System Trace, Snapdragon Profiler, Nsight, RenderDoc), and eliminate bottlenecks at the op and memory-bandwidth level.
  4. Work with WebGPU-targeted inference runtimes (ONNX Runtime Web, Transformers.js, WebLLM, TensorFlow.js) alongside native options (CoreML, ONNX Runtime, TFLite, ExecuTorch), and extend or build glue code where off-the-shelf options fall short of our diffusion and VLM workloads.
  5. Partner with research scientists to turn novel CV and multi-modal architectures into implementations that are deployable, debuggable, and fast on device.

Skills

Required

  • 5+ years in software/ML engineering, with meaningful time focused on on-device / edge inference or real-time, performance-critical systems.
  • Production deployment of transformer- and/or diffusion-based models (e.g., ViT, Stable Diffusion, CLIP/SigLIP-style encoders) on mobile, desktop, or embedded hardware — shipped, not just prototyped.
  • Hands-on experience with at least one major inference runtime (ONNX Runtime / ORT Web, CoreML, TFLite, ExecuTorch) and a working understanding of operator fusion, memory layout, and runtime scheduling.
  • Low-level performance engineering: solid command of at least one GPU/compute API — WebGPU/WGSL, Metal, Vulkan, D3D12, or CUDA — and the profiling tools to go with it. You can read a frame capture and a kernel trace and reason about where the time and memory go.
  • Working knowledge of model-optimization techniques — quantization (INT4/INT8/FP16), weight sharing, pruning, and distillation — and the judgment to apply them to hit latency and memory budgets. You use them effectively as engineering tools.
  • Understanding of target hardware: mobile SoCs (Apple Neural Engine, Qualcomm Hexagon/Adreno, ARM Mali) and/or desktop/laptop GPUs (Apple Silicon, NVIDIA, AMD, Intel).
  • Strong Python for export pipelines and training-side tooling
  • Working fluency with the models you deploy — enough to read an architecture, modify it for deployment, and reason about accuracy trade-offs.
  • A collaborative working style: clear communication, reliable delivery, and a willingness to support and learn from teammates.

Nice to have

  • familiarity with the core languages of a browser-native runtime (TypeScript/JavaScript, WGSL) is a plus.

What the JD emphasized

  • fast, small, and reliably
  • deeply hands-on role
  • shaving milliseconds and megabytes
  • production deployment
  • shipped, not just prototyped
  • low-level performance engineering
  • hit hard latency, memory, and power budgets

Other signals

  • on-device AI
  • inference optimization
  • mobile hardware acceleration
  • quantization
  • model export
  • kernel tuning