What you'd actually do

Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks

Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency

Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling

Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations

Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents

Skills

Required

profiling and optimizing distributed AI or HPC workloads
GPU interconnects
RDMA networking
collective communication frameworks (NCCL or MPI)
debugging complex performance issues
performance monitoring systems
telemetry pipelines
alerting for large-scale infrastructure
driving cross-functional technical projects
systems software development in C++
machine learning frameworks (PyTorch and TensorFlow)
RDMA congestion control mechanisms (IB and RoCE Networks)
AI training workloads
integrating AI tools to optimize/redesign workflows
responsible, ethical AI practices
AI skill development (prompt/context engineering, agent orchestration)

Nice to have

HPC systems
network fabric design
distributed computing
frontier model development
AI supercomputing infrastructure
prompt/context engineering
agent orchestration

What the JD emphasized

drive end-to-end performance characterization, bottleneck analysis, and optimization

ensure Meta's HPC systems deliver maximum throughput and efficiency

investigate and resolve performance regressions

define performance requirements

Build tooling and automation

Establish service level objectives

Lead technical design reviews

Mentor other engineers

Leverage AI-assisted workflows

Experience profiling and optimizing distributed AI or HPC workloads

Experience debugging complex, non-reproducible performance issues

Experience designing and implementing performance monitoring systems

Experience driving cross-functional technical projects

Experience in developing systems software in languages like C++

Understanding of the latest artificial intelligence (AI) technologies

Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact

Experience adhering to and implementing responsible, ethical AI practices

Demonstrated ongoing AI skill development

Meta is building some of the world's largest AI and high-performance computing infrastructure to power next-generation AI research and products. As an AI/HPC System Performance Engineer on the Network Infrastructure Engineering team, you will drive end-to-end performance characterization, bottleneck analysis, and optimization of large-scale AI training and inference clusters. In this role, you will work at the intersection of network fabric design, distributed computing, and AI workload behavior to ensure Meta's HPC systems deliver maximum throughput and efficiency for frontier model development.

Responsibilities

Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets Lead technical design reviews for network and system architecture changes affecting AI workload performance, communicating trade-offs clearly to engineering and product stakeholders Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack

Qualifications

Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience 6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments Experience in developing systems software in languages like C++ Experience with machine learning frameworks such as PyTorch and TensorFlow Understanding of RDMA congestion control mechanisms on IB and RoCE Networks Understanding of the latest artificial intelligence (AI) technologies Understanding of AI training workloads and demands they exert on networks Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements) Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews) Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

Ai/hpc System Performance Engineer

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Responsibilities

Qualifications

Responsibilities

Qualifications