Staff AI Inference and Acceleration Engineer

Figure AI Figure AI · Robotics · HQ · Platform Software

Staff AI Inference and Acceleration Engineer at Figure AI, a robotics company building humanoid robots. The role focuses on optimizing on-board AI inference for real-time execution on heterogeneous hardware, managing compute budgets, and reducing power consumption and cost. Responsibilities include mapping models to accelerators, applying compression techniques, profiling pipelines, and collaborating with AI/ML and Platform Software teams.

What you'd actually do

  1. Own the on-board inference architecture — mapping models to available accelerators (NPU, GPU, DSP, CPU) based on latency, power, and memory budgets.
  2. Partition inference workloads across heterogeneous compute resources, balancing real-time performance with power and thermal constraints.
  3. Define and maintain a system-level compute budget across all inference tasks running on the robot.
  4. Evaluate next-generation acceleration hardware and contribute to the definition of future compute platform requirements.
  5. Optimize inference toolchains end-to-end — from model export through runtime execution — for target hardware.

Skills

Required

  • hardware acceleration
  • ML systems
  • compute architecture
  • AI/ML inference
  • model formats (ONNX, TFLite, etc.)
  • inference runtimes
  • deployment pipelines
  • optimizing models for edge or embedded hardware
  • quantization
  • pruning
  • operator-level tuning
  • computer architecture
  • memory hierarchies
  • data movement
  • heterogeneous compute
  • profiling and benchmarking inference workloads
  • CPU
  • GPU
  • NPU
  • DSP
  • low-level toolchains
  • compilation frameworks (e.g. TVM, MLIR, TensorRT, Torch, SNPE/QNN, JAX, CUDA, ROCm)
  • C++
  • Python
  • cross-functional communication

Nice to have

  • real-time operating constraints
  • inference scheduling
  • co-designing model architectures with ML teams

What the JD emphasized

  • Own the on-board inference architecture
  • Optimize inference toolchains end-to-end
  • Apply quantization (INT8, INT4, mixed-precision), pruning, operator fusion, and other compression techniques
  • Profile inference pipelines to identify and eliminate bottlenecks in latency, memory bandwidth, and power consumption.
  • Optimize kernel scheduling, memory layout, and data movement across the compute hierarchy.

Other signals

  • on-board inference architecture
  • optimize inference toolchains
  • quantization, pruning, operator fusion
  • heterogeneous compute resources
  • real-time autonomous system