Hpc Data Center Production Engineer

Jump Trading Jump Trading · Quant · Chicago, New York City · IT Infrastructure + WCW

This role focuses on building and owning automation and tooling for HPC data center operations, including hardware onboarding, capacity planning, outage simulation, and monitoring. It requires heavy daily use of AI tools for development acceleration and identifying AI applications in data center operations. The role emphasizes production engineering, infrastructure automation, and reliability in large-scale environments.

What you'd actually do

  1. Design, develop, and maintain automation to onboard new hardware devices into Jump's HPC data centers, including servers, network switches, rack PDUs, CDUs, and environmental sensors.
  2. Develop tools for power and cooling capacity planning—enabling the operations and planning teams to model current utilization, forecast growth, and identify constraints before they become problems.
  3. Build and maintain monitoring integrations for HPC data center infrastructure—pulling telemetry from servers, switches, PDUs, CDUs, environmental sensors, and facility systems into centralized observability platforms.
  4. Work very closely with the HPC Planning, Engineering, and Operations leads to understand tooling and monitoring needs and bring their vision to fruition.
  5. Own the reliability and lifecycle of all systems and tools you develop—monitor for failures, respond to issues, and iterate based on operational feedback.

Skills

Required

  • 5+ years of professional experience in production engineering, infrastructure automation, or site reliability engineering, preferably in HPC or large-scale data center environments.
  • Experience automating hardware provisioning and lifecycle management (servers, network devices, power/cooling infrastructure).
  • Strong understanding of data center infrastructure: power distribution, cooling systems (air and liquid), environmental monitoring, and structured cabling.
  • Experience integrating with hardware management interfaces (IPMI/BMC/Redfish, SNMP, vendor APIs) for discovery, configuration, and telemetry collection.
  • High proficiency in Golang and at least one additional language (e.g., Python).
  • Strong Linux systems knowledge—you should live in Linux. Proficient with system administration, networking, storage, process management, log analysis.

Nice to have

  • AI tools daily across all aspects of the role: writing and reviewing code, analyzing data, debugging, generating documentation, and accelerating development velocity.
  • Identify opportunities to apply AI to data center operations problems—anomaly detection, predictive capacity planning, intelligent alerting, and beyond.

What the JD emphasized

  • Proven track record of building and shipping production automation and tooling—not just scripts, but maintained, reliable systems.