Technical Lead Manager - Training Runtime, Data(set) Movement

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Technical Lead Manager for Training Runtime, focusing on the Data Movement area. This role owns the infrastructure for supplying training jobs with data and managing model state during large-scale model training runs. It involves designing and building a unified dataset read platform, defining APIs, storage contracts, and ensuring reliability and reproducibility.

What you'd actually do

  1. Design and build a unified dataset read platform for multiple current and future training frameworks.
  2. Define dataset APIs, storage-format expectations, registration/versioning, and migration paths that make data access reproducible and maintainable.
  3. Build reliability into the read path, including stateful iteration, caching, fast restart, recovery, and clear operational contracts.
  4. Build terminal and web-based visualizers that let teams inspect text, multimodal, and reinforcement learning data late in the pipeline, where bugs are most visible.
  5. Write and review production code in core data loading, service, caching, and reliability paths.

Skills

Required

  • Python
  • distributed systems
  • data loading
  • storage infrastructure
  • large scale training infrastructure
  • API design
  • debugging
  • performance optimization
  • correctness
  • stateful iterators
  • checkpoint/restart semantics
  • caching
  • remote services
  • high-throughput storage reads
  • multimodal data pipelines
  • video data pipelines
  • reinforcement learning data pipelines
  • pretraining data pipelines

Nice to have

  • Rust
  • C++

What the JD emphasized

  • deeply hands-on
  • primary technical owner
  • deceptively hard at frontier scale
  • make enormous, heterogeneous datasets easy to consume, correct across distributed workers, observable when something goes wrong, and flexible enough to support pretraining, reinforcement learning, and multimodal training
  • own fast, correct, scalable, and reliable in-cluster data movement for training
  • After ramping on datasets, this role will expand to TLM ownership for broader data movement systems
  • lead through code and technical judgment before a team exists, and can later manage engineers without losing the hands-on edge

Other signals

  • distributed systems
  • large scale training
  • data infrastructure
  • reliability engineering