Hpc Operations Engineer

Jump Trading Jump Trading · Quant · Chicago, IL · IT Infrastructure + WCW

This role is for an HPC Operations Engineer responsible for managing Linux HPC compute, storage, and interconnects in a 24/7 environment. The primary function is operational support, including responding to alerts, participating in maintenance, and collaborating on infrastructure projects. The role requires strong Linux and HPC experience, proficiency in programming/scripting languages, and the ability to perform root cause analysis and manage complex projects.

What you'd actually do

  1. Provide front-line operational support for 24/7 Linux HPC compute, storage, and interconnects. Technologies involved include RDMA fabrics, parallel filesystems, HPC batch schedulers, FUSE filesystems, internal Jump software, multi-vendor hardware, cybersecurity requirements, a challenging and unpredictable client workload, and high user expectations
  2. Solve problem reports and questions posed by members of Jump's research community, escalating as needed and managing the entire problem lifecycle
  3. Respond to alerts in a timely fashion
  4. Participate in large, coordinated maintenance operations, including during evenings and weekends
  5. Work on global projects across a wide range of infrastructure

Skills

Required

  • Linux systems
  • High performance computing (HPC)
  • parallel filesystems
  • batch systems
  • high-performance network interconnects
  • programming/scripting language (e.g., Go, Python, C)
  • root cause analysis
  • verbal and written communication skills
  • collaboration skills
  • manage complex projects
  • sense of urgency
  • operational maintenance work during evenings and weekends

Nice to have

  • RDMA fabrics
  • FUSE filesystems
  • internal Jump software
  • multi-vendor hardware
  • cybersecurity requirements
  • Slurm
  • Grid Engine

What the JD emphasized

  • operational work as primary job function
  • 2+ years of professional experience with Linux systems
  • 2+ years professional experience working with High performance computing (HPC)