Manager, Operations

xAI xAI · AI Frontier · Memphis, TN · Data Center

This role manages operations and power generation for hyperscale AI compute facilities, focusing on infrastructure reliability and uptime for AI training. It involves leading teams responsible for power, cooling, networking, and overall facility performance.

What you'd actually do

  1. Lead and scale the facilities operations and power generation teams responsible for the reliable operation, maintenance, monitoring, and optimization of critical infrastructure including on-site power generation assets, electrical systems, mechanical/HVAC, liquid cooling, power distribution, UPS, generators, and building management systems.
  2. Direct the fiber teams overseeing the design, deployment, maintenance, and expansion of high-speed fiber optic networks, dark fiber, and connectivity infrastructure supporting AI compute clusters and data center interconnects.
  3. Own key performance metrics such as uptime (targeting 99.999%+), mean time to detect/repair (MTTD/MTTR), power usage effectiveness (PUE), water usage effectiveness (WUE), power generation efficiency, and overall infrastructure availability.
  4. Develop and enforce standard operating procedures (SOPs), preventive maintenance programs, incident response protocols, and continuous improvement processes for both facilities and power generation assets to minimize downtime and maximize efficiency.
  5. Build, mentor, and grow multidisciplinary teams of operations technicians, power generation engineers and controls specialists while fostering a culture of ownership, safety, and excellence.

Skills

Required

  • 5+ years of progressive experience in data center facilities operations, power generation operations, hyperscale infrastructure management, or mission-critical industrial operations, with at least 2+ years in a management or supervisor role.
  • Proven track record leading large-scale operations teams supporting high-density compute environments with significant on-site or dedicated power generation (AI, HPC, or hyperscaler data centers strongly preferred).
  • Strong experience managing fiber optic networks, dark fiber deployments, or high-bandwidth connectivity infrastructure in large-scale technical environments.
  • Deep knowledge of power generation systems (gas turbines, reciprocating engines, cogeneration, etc.), MEP (mechanical, electrical, plumbing) systems, BMS/SCADA, liquid cooling, power redundancy topologies, and 24/7 operations best practices.
  • Demonstrated success delivering high reliability, rapid incident resolution, and operational excellence under aggressive scaling timelines.
  • Hands-on leadership style with the ability to roll up sleeves while effectively managing teams, budgets, and cross-functional stakeholders.
  • Proficiency with operations tools, CMMS (computerized maintenance management systems), monitoring platforms, and data-driven decision making.

Nice to have

  • Direct background in AI or hyperscale data center operations, including liquid cooling systems, high-power GPU/accelerator environments, and on-site power generation.
  • Experience building or scaling fiber infrastructure for low-latency, high-bandwidth interconnects between compute clusters or sites.
  • Familiarity with Uptime Institute Tier standards, ASHRAE guidelines, power generation standards (e.g., IEEE, NFPA), OSHA/EPA compliance, and sustainability practices in critical facilities.
  • Bachelor’s or Master’s degree in Electrical, Mechanical Engineering, Power Systems, Facilities Management, or related field; relevant certifications (CDCP, CDCS, or equivalent) a plus.

What the JD emphasized

  • AI training at unprecedented scale
  • extreme power and cooling demands of next-generation AI systems
  • 99.999%+