Software Engineer, Collective Communication

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer focused on the collective communication stack for large-scale AI model training, using C++ and CUDA to optimize network performance on custom supercomputers. This role directly supports the training of OpenAI's flagship models and collaborates with ML researchers.

What you'd actually do

  1. Collaborate closely with ML researchers to design and implement efficient collective operations in C++ and CUDA.
  2. Ensure that our largest training jobs take full advantage of the different network transports used in our supercomputers.
  3. Work on simulations to inform our future supercomputer network designs.

Skills

Required

  • C++
  • CUDA
  • distributed algorithms
  • RDMA
  • low level performance sensitive CPU and/or GPU code

Nice to have

  • collective communication

What the JD emphasized

  • collective communication stack
  • training jobs
  • flagship models
  • custom built supercomputers
  • ML researchers
  • training platform
  • networking transports
  • low level performance critical software
  • distributed algorithms using RDMA
  • low level performance sensitive CPU and/or GPU code

Other signals

  • training jobs
  • flagship models
  • custom built supercomputers
  • ML researchers
  • training platform
  • networking transports