Performance Modeling Lead

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Lead a team to build and own a performance modeling framework for AI infrastructure systems, analyzing tradeoffs across compute, memory, networking, and storage to guide architectural decisions and influence vendor roadmaps. Requires deep knowledge of AI/ML workloads (training/inference) and large-scale distributed systems.

What you'd actually do

  1. Build and own a performance modeling framework/toolchain to evaluate AI systems across multiple levels of abstraction.
  2. Analyze and quantify architectural tradeoffs across compute, memory, networking, storage, and system topology.
  3. Develop performance models to guide decisions on: scale-up vs. scale-out architectures, interconnect and network design, memory hierarchy and system balance.
  4. Translate modeling outputs into clear recommendations for internal teams and external hardware vendors.
  5. Influence reference designs and vendor roadmaps through data-driven insights.

Skills

Required

  • Performance modeling framework development
  • AI/ML workload understanding (training/inference)
  • System-level tradeoff analysis (compute, memory, networking)
  • Large-scale distributed systems
  • Modeling (analytical or simulation)
  • Translating analysis into recommendations
  • Influencing internal teams and external partners
  • Leadership and team management

Nice to have

  • Hardware vendor experience
  • Data center infrastructure or hyperscale systems
  • Accelerators (GPUs/ASICs) and interconnects
  • Influencing hardware roadmaps or reference architectures
  • Mentoring engineers

What the JD emphasized

  • owning or building performance modeling frameworks used to drive real system design decisions
  • deep knowledge of AI/ML workloads, including training and/or inference at scale
  • Understand system-level tradeoffs across compute, memory, and networking in large-scale distributed systems
  • modeling (analytical or simulation) to inform architectural decisions
  • performance modeling framework/toolchain
  • AI systems
  • architectural tradeoffs
  • performance models
  • system balance
  • hardware vendors
  • workload characteristics

Other signals

  • performance modeling
  • AI infrastructure
  • system architecture
  • quantitative analysis
  • workload characteristics