Datacenter Hardware Operations Technician Lead, Industrial Compute

OpenAI OpenAI · AI Frontier · United States · Remote · Scaling

This role supports the operation and maintenance of AI compute infrastructure, specifically focusing on datacenter hardware operations in collaboration with partners like Oracle. The technician will coordinate physical hardware activities, ensure maintenance and repairs align with OpenAI's compute needs, and help develop operational standards for future projects.

What you'd actually do

  1. Serve as OpenAI’s primary on-site hardware contact, collaborating with Oracle teams and vendors to plan and coordinate maintenance, repairs, and lifecycle activities.
  2. Share technical requirements and verify that work performed supports OpenAI’s compute needs and agreed quality targets.
  3. Coordinate schedules, spare-parts planning, and issue escalation with partner teams to minimize downtime and keep operations running smoothly.
  4. Work with OpenAI fleet-health engineers to translate software-detected issues into on-site hardware actions in partnership with Oracle.
  5. Track hardware trends and provide joint recommendations with partner teams for design or operational improvements.

Skills

Required

  • Datacenter hardware operations
  • Hardware engineering
  • Large-scale server maintenance
  • High-density server hardware
  • x86 platforms
  • GPUs
  • Storage devices
  • Power/cooling systems
  • Diagnosing hardware issues
  • Coordinating complex repairs
  • Building strong working relationships
  • Setting technical expectations
  • Validating outcomes through collaboration
  • Adapting to changing operational conditions
  • Solving problems
  • Clear communication
  • Building trust
  • Full-time on-site presence

Nice to have

  • Large-scale cluster management or monitoring tools (IPMI, BMC, Prometheus, Nagios)
  • GPU-accelerated compute clusters
  • High-performance computing hardware
  • Linux/Unix system administration
  • Command-line diagnostic tools
  • Industry certifications (CompTIA Server+, OEM hardware certifications)
  • Environmental Health and Safety best practices

What the JD emphasized

  • 7+ years of experience in datacenter hardware operations
  • at least 2 years in a senior or lead technician capacity
  • GPU-accelerated compute clusters