Hpc Operations Lead

Jump Trading Jump Trading · Quant · Chicago, New York City · IT Infrastructure + WCW

Lead HPC infrastructure operations, focusing on reliability, standards, and day-to-day excellence in data centers. This role involves team leadership, developing operational standards for power, cooling, and hardware, managing preventative maintenance, and serving as a subject matter expert on critical facility systems. It also includes owning the monitoring and incident response strategy, leveraging AI tools for telemetry analysis and issue prediction, and managing hardware break-fix, inventory, and vendor relationships. Strong networking and Linux knowledge are required.

What you'd actually do

  1. Lead and manage data center site leads and their teams across multiple HPC facilities; site leads report directly to this role.
  2. Develop, document, and enforce operational standards and procedures for Jump's HPC data centers covering power, cooling, cabling, and hardware lifecycle.
  3. Own the HPC data center monitoring strategy end-to-end: define what is monitored, set alerting thresholds, and ensure comprehensive visibility into facility and hardware health.
  4. Own the overall hardware break-fix function across all HPC sites, ensuring rapid diagnosis and resolution for servers, GPUs, network equipment, storage, and facility infrastructure.
  5. Conduct capacity planning for space, power, cooling, and cabling to stay ahead of growth.

Skills

Required

  • Team leadership
  • Data center operations
  • HPC infrastructure
  • Power distribution
  • Cooling systems (air and liquid)
  • Environmental monitoring
  • Hardware lifecycle management
  • Server hardware architecture (multi-socket, GPU, BMC/IPMI)
  • Network switch hardware (Arista, Cisco)
  • Hardware break-fix diagnosis
  • Inventory management
  • Capacity planning
  • Vendor management
  • Budget management
  • Networking concepts (L2/L3, BGP, OSPF, high-performance fabrics)
  • Linux
  • AI tools for operational analysis

Nice to have

  • Experience with AI-driven approaches for automation
  • Experience with Arista and Cisco platforms

What the JD emphasized

  • on-site 5 days/week
  • regular travel to HPC data center sites required
  • Heavy, daily use of AI tools is expected in this role