Distributed Training Engineer, Sora

OpenAI OpenAI · AI Frontier · San Francisco, CA · Research

OpenAI is seeking a Distributed Training Engineer for their Sora team to improve training throughput for video models. This role involves optimizing the internal training framework, collaborating with researchers, and ensuring hardware efficiency for supercomputer training runs.

What you'd actually do

  1. Collaborate with researchers to enable them to develop systems-efficient video models and architectures
  2. Apply the latest techniques to our internal training framework to achieve impressive hardware efficiency for our training runs
  3. Profile and optimize our training framework

Skills

Required

  • Python
  • software engineering skills

Nice to have

  • multi-modal ML pipelines
  • systems implementations
  • training kernels
  • stable training dynamics

What the JD emphasized

  • improving the training throughput
  • enable researchers to experiment with new ideas
  • optimizing performance
  • understanding distributed systems
  • writing bug-free machine learning code
  • acquiring deep knowledge of the performance of supercomputers
  • love optimizing performance
  • understanding distributed systems
  • cannot stand having bugs in their code
  • achieve impressive hardware efficiency

Other signals

  • improving the training throughput
  • enable researchers to experiment with new ideas
  • optimizing performance
  • understanding distributed systems
  • achieve impressive hardware efficiency