Perception Algorithm Engineer

Apple Apple · Big Tech · Cupertino, CA +1 · Machine Learning and AI

This role focuses on designing and implementing real-time multi-object tracking systems using deep learning and multimodal estimates for computer vision problems within Apple products. It involves developing evaluation frameworks, curating datasets, and integrating perception systems into a larger software stack, with a strong emphasis on robotics and state-estimation pipelines.

What you'd actually do

  1. Designing and implementing a robust, real-time multi-object tracking system to solve real-world computer vision problems.
  2. Leveraging multimodal estimates (vision, audio, etc.) to ensure robust, high-fidelity estimation across complex and challenging environments.
  3. Developing rigorous evaluation frameworks, curating datasets, and defining metrics to benchmark model performance, analyze edge cases, and continuously improve perception pipelines.
  4. Integrating perception systems into a larger software stack with real-world performance constraints.

Skills

Required

  • C++ or Swift
  • Python
  • PyTorch or JAX
  • Machine Learning
  • Computer Vision
  • State Estimation

Nice to have

  • Deep Learning
  • Multi-object Tracking
  • Multimodal Fusion
  • Audio Processing
  • Robotics Software Stack
  • Kinematics
  • Planning
  • Controls
  • SLAM
  • Factor Graphs
  • Filtering
  • Sensor Fusion
  • Reinforcement Learning
  • Applied Math
  • Numerical Optimization
  • Geometry
  • Graphics
  • Swift
  • Apple developer tools
  • VLMs
  • VLAs
  • Foundation models
  • Self-supervision
  • Distillation
  • Data Augmentation
  • Reconstruction pipelines
  • Image Processing
  • Camera Systems
  • Computational Photography
  • DSP
  • Echo Cancellation
  • Audio-visual Diarization
  • Speech Recognition

What the JD emphasized

  • PhD in Computer Science, Robotics or MS with 3+ years industry experience
  • systems programming (C++/Swift)
  • Python and modern ML frameworks (e.g., PyTorch, JAX)
  • machine learning and traditional perception and state-estimation pipelines
  • building and/or deploying on-device computer vision models or multi-object tracking systems
  • machine learning approaches and architectures (e.g., VLMs, VLAs, foundation models, self-supervision, distillation, or data augmentation techniques)
  • multimodal data fusion across a variety of inputs and sensors, including audio processing
  • broader robotics software stack (e.g., kinematics, planning, controls) alongside state estimation methods (e.g., SLAM, factor graphs, filtering, sensor fusion) and reinforcement learning methods

Other signals

  • real-time multi-object tracking
  • multimodal estimates
  • deep learning
  • computer vision
  • robotics software stack