Network Engineer - Ai/hpc

xAI xAI · AI Frontier · Palo Alto, CA · Infrastructure

xAI is seeking a Network Engineer with deep experience in RoCEv2 to develop and optimize hyper-scale AI/HPC networks. The role involves designing, operating, and optimizing network performance for AI training and inference workloads, with a focus on NCCL, metric dashboards, and Python automation. The engineer will also contribute to designing future network iterations and participate in on-call rotations and travel.

What you'd actually do

  1. develop at hyper scale while optimizing performance and availability.
  2. understand our current network performance and availability and then optimize it to our training models and how we execute customer inference queries.
  3. spend most of your days deep inside NCCL, building metric dashboards and tweaking configurations to ensure no performance is left on the table.
  4. help design the next iteration of our backend and front-end networks that will allow us to seamlessly build-out new GPU infrastructure with little to no engineering assistance.
  5. participating in a team on-call rotation and helping on other scaling and maintenance efforts.

Skills

Required

  • 10 years designing and operating large scale networks
  • 5 years in the ethernet AI/HPC space
  • Deep understanding of congestion control on ethernet
  • Deep understanding of AI training and inference workloads and how they operate on the network
  • use and debug NCCL and potentially commit to the library
  • Expertise in creating a portfolio of metrics for performance and operations to optimize the fleet for training and inference traffic
  • Experience with Python to automate away repetitive tasks and facilitate your daily job working with and analyzing large sets of data

Nice to have

  • Infiniband

What the JD emphasized

  • Deep understanding of AI training and inference workloads and how they operate on the network.
  • use and debug NCCL and potentially commit to the library.