Software Engineer, Data Infrastructure - Research

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer focused on building and scaling the dataset infrastructure for LLM training and inference, including multimodal data, at OpenAI.

What you'd actually do

  1. Design and maintain standardized dataset APIs, including for multimodal (MM) data that cannot fit in memory.
  2. Build proactive testing and scale validation pipelines for dataset loading at GPU scale.
  3. Collaborate with teammates to integrate datasets seamlessly into training and inference pipelines, ensuring smooth adoption and a great user experience.
  4. Document and maintain dataset interfaces so they are discoverable, consistent, and easy for other teams to adopt.
  5. Establish safeguards and validation systems to ensure datasets remain reproducible and unchanged once standardized.

Skills

Required

  • distributed systems
  • data pipelines
  • infrastructure
  • API design
  • scalable abstractions
  • debugging performance bottlenecks
  • large fleets of machines

Nice to have

  • data math
  • probability
  • distributed data theory
  • GPU-scale distributed systems
  • dataset scaling for real-time data

What the JD emphasized

  • multimodal data
  • GPU scale
  • distributed dataset loading

Other signals

  • dataset infrastructure
  • LLM training
  • multimodal data
  • GPU scale