Ai/hpc System Performance Engineer

Meta Meta · Big Tech · Menlo Park, CA

Meta is seeking an AI/HPC System Performance Engineer to optimize the performance of large-scale AI training and inference clusters. This role involves profiling, bottleneck analysis, and developing performance frameworks, with a focus on network infrastructure and distributed computing for frontier AI model development. The engineer will collaborate with various teams, build tooling, and mentor others, while also leveraging AI-assisted workflows.

What you'd actually do

  1. Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
  2. Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency
  3. Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling
  4. Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations
  5. Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents

Skills

Required

  • profiling and optimizing distributed AI or HPC workloads
  • GPU interconnects
  • RDMA networking
  • collective communication frameworks (NCCL or MPI)
  • debugging complex performance issues
  • performance monitoring systems
  • telemetry pipelines
  • alerting for large-scale infrastructure
  • driving cross-functional technical projects
  • systems software development in C++
  • machine learning frameworks (PyTorch and TensorFlow)
  • RDMA congestion control mechanisms (IB and RoCE Networks)
  • AI training workloads
  • integrating AI tools to optimize/redesign workflows
  • responsible, ethical AI practices
  • AI skill development (prompt/context engineering, agent orchestration)

Nice to have

  • HPC systems
  • network fabric design
  • distributed computing
  • frontier model development
  • AI supercomputing infrastructure
  • prompt/context engineering
  • agent orchestration

What the JD emphasized

  • drive end-to-end performance characterization, bottleneck analysis, and optimization
  • ensure Meta's HPC systems deliver maximum throughput and efficiency
  • investigate and resolve performance regressions
  • define performance requirements
  • Build tooling and automation
  • Establish service level objectives
  • Lead technical design reviews
  • Mentor other engineers
  • Leverage AI-assisted workflows
  • Experience profiling and optimizing distributed AI or HPC workloads
  • Experience debugging complex, non-reproducible performance issues
  • Experience designing and implementing performance monitoring systems
  • Experience driving cross-functional technical projects
  • Experience in developing systems software in languages like C++
  • Understanding of the latest artificial intelligence (AI) technologies
  • Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact
  • Experience adhering to and implementing responsible, ethical AI practices
  • Demonstrated ongoing AI skill development

Other signals

  • large-scale AI training and inference clusters
  • network infrastructure
  • performance characterization
  • bottleneck analysis
  • optimization