Software Engineer: Network (c++)

xAI xAI · AI Frontier · Palo Alto, CA +1 · Engineering

Develops core networking software for a massive GPU cluster powering AI models, focusing on high-performance, low-latency datacenter fabric for training efficiency and reliability.

What you'd actually do

  1. Develop routing and traffic-engineering algorithms for the Colossus high-performance datacenter network.
  2. Develop highly reliable, real-time software designed to run on the switches that form the backbone of our low-latency, high-bandwidth AI training fabric.
  3. Participate in and lead architecture, design, and code reviews.
  4. Develop prototypes and run experiments to validate key design decisions at both small and full-cluster scale.
  5. Build tools for software development, deployment, data analysis, visualization, and testing across virtualized environments, hardware-in-the-loop setups, and live production clusters.

Skills

Required

  • Bachelor’s degree in computer science, engineering, math, or a related technical discipline; OR 2+ years of professional software development experience in lieu of a degree.
  • Strong development experience in C or C++.

Nice to have

  • Strong professional experience writing high-performance C/C++ in production environments.
  • Experience developing, debugging, and deploying software that runs at scale in real-world systems.
  • Deep knowledge of networking protocols (UDP, TCP/IP, RDMA, etc.), distributed systems, and large-scale datacenter fabrics.
  • Background in real-time systems, high-performance computing, low-latency networking, or resource-constrained environments.
  • Creative problem-solving ability with exceptional analytical skills and strong engineering fundamentals.
  • Excellent written and verbal communication skills.
  • Ability to thrive in a fast-paced, dynamic environment with evolving requirements.
  • Experience with security considerations in large-scale distributed systems.

What the JD emphasized

  • custom, high-performance datacenter network
  • ultra-low latency
  • massive bandwidth
  • hundreds of thousands of GPUs
  • impact training efficiency
  • model convergence
  • push the frontier of AI
  • solve hard problems in distributed systems
  • high-performance networking
  • real-time control
  • one of the largest AI supercomputers on Earth
  • Develop routing and traffic-engineering algorithms
  • Develop highly reliable, real-time software
  • low-latency, high-bandwidth AI training fabric
  • Develop prototypes and run experiments
  • full-cluster scale
  • live production clusters
  • high-performance C/C++ in production environments
  • software that runs at scale in real-world systems
  • Deep knowledge of networking protocols
  • distributed systems
  • large-scale datacenter fabrics
  • real-time systems
  • high-performance computing
  • low-latency networking

Other signals

  • powers Grok and our frontier AI models
  • custom, high-performance datacenter network that delivers ultra-low latency and massive bandwidth across hundreds of thousands of GPUs
  • impact training efficiency, model convergence, and the speed at which we can push the frontier of AI
  • solve hard problems in distributed systems, high-performance networking, and real-time control of one of the largest AI supercomputers on Earth