Training: ML Framework Engineer

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

This role focuses on improving the core distributed machine-learning training runtime for OpenAI, aiming to accelerate researchers and enable frontier-scale model runs. The engineer will work on high-performance data movement, fault-tolerant training frameworks, and distributed process management to increase both training throughput and researcher throughput.

What you'd actually do

  1. Apply the latest techniques in our internal training framework to achieve impressive hardware efficiency for our training runs
  2. Profile and optimize our training framework
  3. Work with researchers to enable them to develop the next generation of models

Skills

Required

  • Python
  • software engineering skills

Nice to have

  • run small scale ML experiments
  • figuring out how systems work
  • make them faster while minimizing complexity and maintenance burden

What the JD emphasized

  • good engineering
  • writing bug-free machine learning code
  • deep knowledge of the performance of supercomputers
  • optimizing performance
  • understanding distributed systems
  • cannot stand having bugs in their code

Other signals

  • improving the training throughput for our internal training framework
  • enabling researchers to experiment with new ideas
  • achieve impressive hardware efficiency for our training runs
  • Profile and optimize our training framework
  • Work with researchers to enable them to develop the next generation of models