Software Engineer, Hardware Health

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer role focused on building and maintaining the infrastructure for large-scale AI model training supercomputers. Responsibilities include system health checks, deep dives into hardware failures, and building automation to ensure reliability and efficiency during training runs. This role supports cutting-edge AI research by keeping the underlying systems stable.

What you'd actually do

  1. Own and improve the system health checks that keep our hyperscale supercomputers stable during model training.
  2. Lead deep dives into hardware failures and system-level bugs to understand how things break at scale.
  3. Build automation that monitors and fixes issues across thousands of machines - so researchers can keep moving without interruption.

Skills

Required

  • Python
  • shell scripting
  • SQL
  • PromQL
  • Pandas

Nice to have

  • low level details of hardware components
  • protocols
  • Linux tooling
  • PCIe
  • Infiniband
  • networking
  • power management
  • kernel perf tuning
  • visualization of large data centers and networks
  • network operations and tooling
  • power management and stabilization

What the JD emphasized

  • 7+ years of industry experience in software engineering