Senior Machine Learning Engineer - Model Inference

Apple Apple · Big Tech · Cupertino, CA +1 · Software and Services

This role focuses on building and optimizing large-scale, high-performance ML inference services for Apple Maps, handling deep learning and large language models. Responsibilities include owning the technical architecture, system-level optimization (latency, throughput, accuracy, cost), control-plane services for model lifecycle management, and optimizing inference across heterogeneous compute environments. The role requires expertise in deploying and optimizing LLMs for production inference, proficiency in Python/Java/C++, deep learning frameworks, model serving tools, optimization techniques, and cloud technologies like Kubernetes.

What you'd actually do

  1. Own the technical architecture of large-scale ML inference platforms, defining long-term design direction for serving deep learning and large language models across Apple Maps.
  2. Lead system-level optimization efforts across the inference stack, balancing latency, throughput, accuracy, and cost through advanced techniques such as quantization, kernel fusion, speculative decoding, and efficient runtime scheduling.
  3. Design and evolve control-plane services responsible for model lifecycle management, including deployment orchestration, versioning, traffic routing, rollout strategies, capacity planning, and failure handling in production environments.
  4. Drive adoption of platform abstractions and standards that enable partner teams to onboard, deploy, and operate models reliably and efficiently at scale.
  5. Partner closely with research, product, and infrastructure teams to translate model requirements into production-ready systems, providing technical guidance and feedback to influence upstream model design.

Skills

Required

  • ML inference
  • GPU acceleration
  • large-scale systems
  • deploying and optimizing LLMs
  • Python
  • Java
  • C++
  • PyTorch
  • TensorFlow
  • Hugging Face Transformers
  • NVIDIA Triton
  • TensorFlow Serving
  • VLLM
  • Attention Fusion
  • Quantization
  • Speculative Decoding
  • CUDA
  • TensorRT-LLM
  • cuDNN
  • Kubernetes
  • Ingress
  • HAProxy

Nice to have

  • ML Ops practices
  • continuous integration
  • deployment pipelines for machine learning models
  • model distillation
  • low-rank approximations
  • model compression techniques
  • distributed systems
  • multi-GPU/multi-node parallelism
  • system-level optimization for large-scale inference

What the JD emphasized

  • 5+ years in software engineering focused directly on ML inference, GPU acceleration, and large-scale systems.
  • Expertise in deploying and optimizing LLMs for high-performance, production-scale inference.

Other signals

  • high-volume, low-latency, highly available production serving
  • large-scale, high-performance inference services
  • optimize inference across heterogeneous accelerated compute hardware
  • deploying and optimizing LLMs for high-performance, production-scale inference