Engineering Manager (ai Inference)

Perplexity Perplexity · AI Frontier · San Francisco, CA · AI

Engineering Manager for AI Inference team focused on building and scaling the infrastructure for Perplexity's AI capabilities, including deploying and optimizing large sparse/MoE models, developing inference APIs, and improving system reliability and observability.

What you'd actually do

  1. Lead and grow a high-performing team of AI inference engineers
  2. Develop APIs for AI inference used by both internal and external customers
  3. Architect and scale our inference infrastructure for reliability and efficiency
  4. Benchmark and eliminate bottlenecks throughout our inference stack
  5. Drive large sparse/MoE model inference at rack scale, including sharding strategies for massive models

Skills

Required

  • 5+ years of engineering experience
  • 2+ years in a technical leadership or management role
  • Deep experience with ML systems and inference frameworks (PyTorch, TensorFlow, ONNX, TensorRT, vLLM)
  • Strong understanding of LLM architecture: Multi-Head Attention, Multi/Grouped-Query Attention, and common layers
  • Experience with inference optimizations: batching, quantization, kernel fusion, FlashAttention
  • Familiarity with GPU characteristics, roofline models, and performance analysis
  • Experience deploying reliable, distributed, real-time systems at scale
  • Track record of building and leading high-performing engineering teams
  • Experience with parallelism strategies: tensor parallelism, pipeline parallelism, expert parallelism
  • Strong technical communication and cross-functional collaboration skills

Nice to have

  • Experience with CUDA, Triton, or custom kernel development
  • Background in training infrastructure and RL workloads
  • Experience with Kubernetes and container orchestration at scale
  • Published work or contributions to inference optimization research

What the JD emphasized

  • build and scale the infrastructure
  • architect and scale
  • large-scale deployment
  • building and leading high-performing engineering teams
  • deploying reliable, distributed, real-time systems at scale

Other signals

  • scaling inference infrastructure
  • deploying large sparse/MoE models
  • optimizing LLM inference