Software Engineer, Frontier Systems - Power Management

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer focused on power management for large-scale supercomputers used for AI model training. The role involves developing and implementing system-level and software-level solutions to optimize power usage, build automation for monitoring and stabilization, and collaborate with researchers and hardware teams. Requires strong software engineering experience, Python proficiency, and knowledge of distributed systems and electrical engineering concepts.

What you'd actually do

  1. Develop and implement system-level and software-level solutions to optimize power usage in large-scale supercomputers, ensuring efficient and reliable operations.
  2. Build automation to monitor power consumption patterns during training workloads and design algorithms to stabilize these fluctuations, preventing issues with grid reliability.
  3. Work with researchers and engineers to design tools for real-time monitoring, detection, and remediation of power-related hardware and system faults.
  4. Collaborate cross-functionally to translate complex electrical system requirements into code, while driving continuous improvements in power management solutions.
  5. Drive the development of power throttling mechanisms at the IT system level to dynamically adjust power usage based on workload demands and infrastructure limitations.

Skills

Required

  • Python
  • automation and scripting tools
  • distributed systems
  • electrical engineering concepts
  • system-level investigations
  • automated solutions
  • power management
  • fault detection
  • remediation
  • analytical skills
  • SQL
  • PromQL
  • Pandas
  • hardware and software teams

Nice to have

  • Deep expertise with the power characteristics of synchronous workloads
  • power control requirements in IT hardware design
  • control system fundamentals

What the JD emphasized

  • 7+ years of software engineering experience
  • large-scale, system-level challenges
  • power management
  • fault detection
  • remediation