Software Engineer, Systemml - Scaling / Performance

Meta Meta · Big Tech · Menlo Park, CA

Meta's Network.AI team is seeking Software Engineers to enhance the NCCL software stack, crucial for multi-GPU and multi-node distributed ML training. The role focuses on improving the reliability and performance of large-scale AI/GPU communication for Meta-wide ML products, particularly for GenAI/LLM training and inference.

What you'd actually do

  1. Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling

Skills

Required

  • Distributed ML Training
  • GPU architecture
  • ML systems
  • AI infrastructure
  • high performance computing
  • performance optimizations
  • Machine Learning frameworks (e.g. PyTorch)
  • GPU architectures
  • CUDA programming
  • DL frameworks like PyTorch, Caffe2 or TensorFlow
  • AI framework and trainer development on accelerating large-scale distributed deep learning models
  • data parallel and model parallel training
  • Distributed Data Parallel
  • Fully Sharded Data Parallel (FSDP)
  • Tensor Parallel
  • Pipeline Parallel
  • HPC and parallel computing
  • ML, deep learning and LLM
  • NCCL and distributed GPU reliability/performance improvment on RoCE/Infiniband

Nice to have

  • PhD in Computer Science, Computer Engineering, or relevant technical field

What the JD emphasized

  • critical path of multi-GPU distributed training
  • nearly every distributed GPU-based ML workload in Meta Production goes through the SW stack the team owns
  • GenAI/LLM scaling reliability and performance

Other signals

  • enabling large-scale GPU training and inference fleet
  • improving full-stack distributed ML reliability and performance
  • GenAI/LLM scaling reliability and performance