Network Engineer - ML Infrastructure (high-speed Interconnects)

xAI xAI · AI Frontier · Palo Alto, CA · Infrastructure

This role focuses on designing, building, and optimizing the network fabric for large-scale AI training and inference clusters, specifically dealing with high-speed interconnect technologies like copper and optical solutions. The engineer will work on physical layer and system-level integration, vendor management, and ensuring the performance, power efficiency, and cost of these critical AI infrastructure components.

What you'd actually do

  1. Design, validate, and productize high-speed copper and optical connectivity solutions for AI clusters (100k+ GPU scale).
  2. Own vendor due diligence and onboarding for new 1.6T products including AEC and pluggable optical transceivers (DR4/8, FR4) including rigorous bring-up & characterization.
  3. Investigate the opportunity for LPO and LRO in our network.
  4. Evaluate early co-packaged and near-packaged engines for switches and GPUs.
  5. Pathfinding for new interconnect modalities including VCSEL, microLED, THz radio-based solutions to improve network economics and reliability.

Skills

Required

  • Designing, deploying and operating high-speed copper and optical interconnects
  • Module design role or hyperscale datacenter environment experience
  • Deep knowledge of PAM4 SerDes performance, equalization, jitter, crosstalk
  • Solid operational understanding of FEC, Retimers, TIAs and Drivers
  • Deep knowledge of optical link budget analysis and performance metrics
  • Expertise in transceiver components
  • Knowledge of thermal, mechanical, power, signal integrity constraints
  • Knowledge of SiPh design process, yield improvement and reliability testing
  • Familiarity with CPO technologies and challenges/risk areas
  • Familiarity with subcomponent supply chains and global manufacturers, ODMs and CMs
  • Strong problem-solving skills
  • Ability to thrive in a fast-paced, ambiguous setting

Nice to have

  • design and development to build and operations
  • physical layer and system-level integration
  • ML cluster requirements with cutting-edge interconnect hardware
  • large-scale AI systems and the physics/engineering of 200G+ SerDes, PAM4, photonics, signal integrity and diagnostics
  • vendor due diligence and onboarding
  • rigorous bring-up & characterization
  • pathfinding for new interconnect modalities
  • influence roadmaps
  • translate workload communication patterns into concrete interconnect topology and optical reconfigurability requirements
  • system-level simulation
  • failure analysis, root cause, and corrective actions
  • fleet-level metrics gathering and analysis
  • internal tooling and automation
  • interconnect health monitoring, telemetry, diagnostics, remediation and automated qualification pipelines
  • industry standards (OIF CMIS, IEEE)
  • emerging technologies (multi-core/hollow-core fiber, 448G SerDes, TFLN, ring resonators)

What the JD emphasized

  • high-speed interconnects
  • AI training and inference clusters
  • 100k+ GPU scale
  • 1.6T products
  • LPO and LRO
  • co-packaged and near-packaged engines
  • VCSEL, microLED, THz radio-based solutions
  • next-gen solutions
  • workload communication patterns
  • end-to-end fabric performance
  • interconnect-related issues
  • automated qualification pipelines
  • industry standards
  • emerging technologies
  • 8+ years of hands-on experience
  • high-speed copper and optical interconnects
  • module design role
  • hyperscale datacenter environment
  • PAM4 SerDes performance
  • FEC, Retimers, TIAs and Drivers
  • optical link budget analysis
  • transceiver components
  • SiPh design process
  • CPO technologies
  • subcomponent supply chains
  • fast-paced, ambiguous setting