What you'd actually do

Develop routing and traffic-engineering algorithms for the Colossus high-performance datacenter network.

Develop highly reliable, real-time software designed to run on the switches that form the backbone of our low-latency, high-bandwidth AI training fabric.

Participate in and lead architecture, design, and code reviews.

Develop prototypes and run experiments to validate key design decisions at both small and full-cluster scale.

Build tools for software development, deployment, data analysis, visualization, and testing across virtualized environments, hardware-in-the-loop setups, and live production clusters.

Skills

Required

Bachelor’s degree in computer science, engineering, math, or a related technical discipline; OR 2+ years of professional software development experience in lieu of a degree.
Strong development experience in C or C++.

Nice to have

Strong professional experience writing high-performance C/C++ in production environments.
Experience developing, debugging, and deploying software that runs at scale in real-world systems.
Deep knowledge of networking protocols (UDP, TCP/IP, RDMA, etc.), distributed systems, and large-scale datacenter fabrics.
Background in real-time systems, high-performance computing, low-latency networking, or resource-constrained environments.
Creative problem-solving ability with exceptional analytical skills and strong engineering fundamentals.
Excellent written and verbal communication skills.
Ability to thrive in a fast-paced, dynamic environment with evolving requirements.
Experience with security considerations in large-scale distributed systems.

What the JD emphasized

custom, high-performance datacenter network

ultra-low latency

massive bandwidth

hundreds of thousands of GPUs

impact training efficiency

model convergence

push the frontier of AI

solve hard problems in distributed systems

high-performance networking

real-time control

one of the largest AI supercomputers on Earth

Develop routing and traffic-engineering algorithms

Develop highly reliable, real-time software

low-latency, high-bandwidth AI training fabric

Develop prototypes and run experiments

full-cluster scale

live production clusters

high-performance C/C++ in production environments

software that runs at scale in real-world systems

Deep knowledge of networking protocols

distributed systems

large-scale datacenter fabrics

real-time systems

high-performance computing

low-latency networking

Other signals

powers Grok and our frontier AI models

custom, high-performance datacenter network that delivers ultra-low latency and massive bandwidth across hundreds of thousands of GPUs

impact training efficiency, model convergence, and the speed at which we can push the frontier of AI

solve hard problems in distributed systems, high-performance networking, and real-time control of one of the largest AI supercomputers on Earth

ABOUT xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

SOFTWARE ENGINEER, NETWORK C++ (COLOSSUS)

At xAI, we design, build, and operate Colossus from the ground up. This includes the massive GPU clusters, high-speed interconnect fabric, and the software that makes it all work at unprecedented scale. Colossus powers Grok and our frontier AI models with a custom, high-performance datacenter network that delivers ultra-low latency and massive bandwidth across hundreds of thousands of GPUs.

As a Software Engineer on the Colossus Networking team, you will develop the core networking software that maximizes the performance and reliability of our datacenter fabric. Your work will directly impact training efficiency, model convergence, and the speed at which we can push the frontier of AI.

Our engineers own the full lifecycle of their software — from design and implementation to deployment, monitoring, and iteration based on real-world performance at scale. You will solve hard problems in distributed systems, high-performance networking, and real-time control of one of the largest AI supercomputers on Earth.

RESPONSIBILITIES:

Develop routing and traffic-engineering algorithms for the Colossus high-performance datacenter network.
Develop highly reliable, real-time software designed to run on the switches that form the backbone of our low-latency, high-bandwidth AI training fabric.
Participate in and lead architecture, design, and code reviews.
Develop prototypes and run experiments to validate key design decisions at both small and full-cluster scale.
Build tools for software development, deployment, data analysis, visualization, and testing across virtualized environments, hardware-in-the-loop setups, and live production clusters.
Deploy reliable software updates through continuous integration and release systems with rigorous testing and monitoring.

BASIC QUALIFICATIONS:

Bachelor’s degree in computer science, engineering, math, or a related technical discipline; OR 2+ years of professional software development experience in lieu of a degree.
Strong development experience in C or C++.

PREFERRED SKILLS AND EXPERIENCE:

Strong professional experience writing high-performance C/C++ in production environments.
Experience developing, debugging, and deploying software that runs at scale in real-world systems.
Deep knowledge of networking protocols (UDP, TCP/IP, RDMA, etc.), distributed systems, and large-scale datacenter fabrics.
Background in real-time systems, high-performance computing, low-latency networking, or resource-constrained environments.
Creative problem-solving ability with exceptional analytical skills and strong engineering fundamentals.
Excellent written and verbal communication skills.
Ability to thrive in a fast-paced, dynamic environment with evolving requirements.
Experience with security considerations in large-scale distributed systems.

ADDITIONAL REQUIREMENTS:

Must be willing to work extended hours and weekends as needed.

_xAI is an equal opportunity employer. For details on data processing, view our _Recruitment Privacy Notice.