Machine Learning Engineer, Distributed Data Systems - Robotics

OpenAI OpenAI · AI Frontier · San Francisco, CA · Research

Machine Learning Engineer focused on designing and scaling distributed data infrastructure for large-scale multimodal training and evaluation in robotics, ensuring reliability and efficiency for rapid iteration cycles.

What you'd actually do

  1. Design, build, and maintain data infrastructure systems such as distributed compute, data orchestration, distributed storage, streaming infrastructure, machine learning infrastructure while ensuring scalability, reliability, and security.
  2. Ensure our data platform can scale by orders of magnitude while remaining reliable and efficient.
  3. Partner with researchers to deeply understand requirements and translate them into production-ready systems.
  4. Harden, optimize, and maintain critical data infrastructure systems that power multimodal training and evaluation.

Skills

Required

  • distributed systems
  • large-scale infrastructure
  • data infrastructure
  • data orchestration
  • distributed storage
  • streaming infrastructure
  • machine learning infrastructure
  • scalability
  • reliability
  • security
  • software engineering fundamentals
  • organizational skills

Nice to have

  • robotics
  • multimodal training
  • model evaluation
  • ambiguity
  • rapid change

What the JD emphasized

  • strong experience with distributed systems
  • large-scale infrastructure
  • detail-oriented
  • building reliable infrastructure in high-stakes environments
  • rigor to building and maintaining reliable systems

Other signals

  • design and scale infrastructure for multimodal training and evaluation
  • manage distributed data pipelines
  • translate requirements into robust systems
  • harden pipelines for rapid iteration cycles