Cpu Storage Tech Lead

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

This role focuses on the physical infrastructure for large-scale AI systems, specifically the CPU and storage architecture strategy for compute clusters. It involves evaluating hardware, defining storage systems, and optimizing server platforms for AI training and inference, working closely with hardware vendors and cross-functional teams.

What you'd actually do

  1. Own CPU and storage technical strategy for Stargate compute infrastructure across current and future generations.
  2. Evaluate CPU platforms across performance, efficiency, memory bandwidth, PCIe topology, cost, and roadmap alignment.
  3. Define storage architectures for AI environments, including boot media, local NVMe, shared storage, caching tiers, metadata services, and high-performance data pipelines.
  4. Drive server platform decisions involving CPU, memory, NIC, GPU, and storage subsystem integration.
  5. Partner with performance modeling teams to quantify tradeoffs across compute, memory, I/O, and storage bottlenecks.

Skills

Required

  • Bachelor’s degree in Computer Engineering, Electrical Engineering, Computer Science, or related technical field
  • 10+ years of experience in server hardware, systems architecture, data center infrastructure, or hyperscale compute platforms
  • Deep expertise in modern CPU architectures (x86, ARM, accelerator host systems) and server platform design
  • Strong understanding of memory systems, PCIe/CXL fabrics, NUMA behavior, and platform-level performance constraints
  • Experience with storage systems including NVMe, SSD qualification, RAID, distributed storage, object/file systems, or high-performance data pipelines
  • Experience evaluating hardware tradeoffs across performance, cost, power, thermals, and supply availability
  • Experience working directly with OEMs, ODMs, silicon vendors, or storage vendors
  • Strong systems thinking with ability to connect component decisions to fleet-level outcomes
  • Excellent communication skills with the ability to influence engineering and executive stakeholders
  • Proven ability to operate in fast-moving, ambiguous environments with high ownership

Nice to have

  • Familiarity with GPU clusters and AI training/inference infrastructure strongly preferred
  • Familiarity with CPU vendor roadmaps across AMD, Intel, and ARM ecosystems
  • Experience with distributed storage architectures supporting GPU clusters
  • Knowledge of fleet operations, hardware lifecycle management, and production deployments at scale
  • Prior experience in hyperscale cloud, AI infrastructure, or advanced compute environments

What the JD emphasized

  • AI training/inference infrastructure strongly preferred