GPU Software Architecture Engineer, Graphics, Games, & ML

Apple Apple · Big Tech · Cupertino, CA · Software and Services

This role focuses on architecting and building distributed ML infrastructure for large-scale inference, specifically for powering Apple Intelligence. It involves designing parallelization strategies, optimizing the stack from low-level access to high-level algorithms, and collaborating with hardware architects. The goal is to achieve maximum hardware utilization and minimize latency for real-time user experiences, serving billions of requests daily.

What you'd actually do

  1. Design and implement tensor/data/expert parallelism strategies for large language model inference across distributed server cluster environments
  2. Drive hardware and software roadmap decisions for ML acceleration
  3. Expert in designing architectures that achieves peak compute utilizations and optimal memory throughput
  4. Develop and optimize distributed inference systems with focus on latency, throughput, and resource efficiency across multiple nodes
  5. Architect scalable ML serving infrastructure supporting dynamic model sharding, load balancing, and fault tolerance

Skills

Required

  • GPU programming (CUDA, ROCm)
  • High-performance computing
  • Inter-node communication technologies (InfiniBand, RDMA, NCCL)
  • System programming in C/C++
  • Distributed systems
  • Parallel computing architectures
  • Tensor frameworks (PyTorch, JAX, TensorFlow)

Nice to have

  • Python
  • ML infrastructure at scale

What the JD emphasized

  • 10+ years of experience in GPU programming (CUDA, ROCm) and high-performance computing, successfully optimizing large-scale parallel workloads
  • Strong experience with inter-node communication technologies (InfiniBand, RDMA, NCCL) in the context of ML training/inference
  • Must have excellent system programming skills in C/C++
  • Deep understanding of distributed systems and parallel computing architectures
  • Understand how tensor frameworks (PyTorch, JAX, TensorFlow) are used in distributed training/inference
  • Proven track record in ML infrastructure at scale

Other signals

  • distributed ML infrastructure
  • Apple Intelligence
  • massive network models
  • server clusters
  • real-time user experiences
  • inference workload characteristics
  • billions of requests daily
  • production systems