Software Engineer, Core Network Engineering

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer role focused on the end-to-end networking stack for OpenAI's AI infrastructure, ensuring network performance and reliability for large-scale training and inference workloads. Responsibilities include designing, building, and operating networking systems, improving performance, developing automation, and building observability tools. Requires experience with large-scale networking, distributed systems, Linux networking, and high-performance technologies like InfiniBand and RoCE.

What you'd actually do

  1. Design, build, and operate networking systems that support large-scale AI training and inference infrastructure
  2. Improve performance, reliability, and scalability across host networking, datacenter fabrics, and WAN systems
  3. Develop automation for provisioning, configuration management, validation, upgrades, and lifecycle management of networking infrastructure
  4. Build tooling and observability systems for network health, performance analysis, debugging, and automated remediation
  5. Optimize network performance across technologies such as RDMA, RoCE, InfiniBand, Ethernet, and high-performance GPU interconnects

Skills

Required

  • building or operating large-scale networking or distributed systems infrastructure
  • Linux networking
  • kernel systems
  • NICs
  • RDMA
  • performance-sensitive infrastructure software
  • high-performance networking technologies such as InfiniBand, RoCE, DPDK, or large-scale Ethernet fabrics
  • datacenter networking
  • WAN systems
  • host networking stacks
  • debugging complex systems and performance bottlenecks
  • production software in languages such as C++, Python, or Go
  • strong systems fundamentals across networking, operating systems, distributed systems, or infrastructure engineering

Nice to have

  • operating close to the hardware/software boundary

What the JD emphasized

  • networking is never the bottleneck
  • predictable, high-throughput, low-latency connectivity
  • microseconds of latency
  • tail performance
  • network reliability
  • performance-critical infrastructure problems at massive scale