Hpc Production Engineer

Jump Trading Jump Trading · Quant · Chicago, IL · IT Infrastructure + WCW

Jump Trading Group is seeking a Production Engineer for their High Performance Computing Team in Chicago. The role involves designing, implementing, and maintaining large-scale compute, storage, and network systems to support quantitative research. Requires strong Linux administration, software development background, and experience with HPC environments.

What you'd actually do

  1. Design, implement, maintain, and support high performance compute and storage systems
  2. Implement and support performance monitoring and fault monitoring systems
  3. Monitor systems and storage performance, up to and including network components
  4. Build tooling to compile, package, install, and upgrade software and operating system components at scale
  5. Collaborate with team members and across teams to write code and testing infrastructures spanning both new and existing codebases in multiple programming languages

Skills

Required

  • 5+ years of professional experience in high performance computing (HPC), including parallel filesystems (e.g., Lustre, GPFS), batch systems (e.g., Slurm, Grid Engine), and high-performance network interconnects experience is a plus, but not required
  • 5+ years of experience with Linux systems administration
  • High proficiency with at least one programming/scripting language (e.g., Go, Python, C)
  • Extensive experience designing, building, and maintaining complicated, interdependent, and distributed systems
  • Extensive experience profiling and debugging application stacks (debuggers and profilers)
  • Experience with system configuration management tools (SaltStack, Ansible, Puppet, etc.)
  • A compulsion to perform root cause analysis
  • Reliable and predictable availability

Nice to have

  • high-performance network interconnects

What the JD emphasized

  • high performance computing
  • Linux systems administration
  • programming/scripting language
  • designing, building, and maintaining complicated, interdependent, and distributed systems
  • profiling and debugging application stacks