Hpc Production Engineer

Jump Trading Jump Trading · Quant · Sydney, Australia · IT Infrastructure + WCW

Jump Trading is seeking an HPC Production Engineer to design, implement, maintain, and support high-performance compute and storage systems. This role involves building tooling for software and OS component management, collaborating with researchers to optimize HPC infrastructure, and providing operational support. Requires 5+ years of experience in HPC, Linux systems administration, and proficiency in programming/scripting languages like Go, Python, or C.

What you'd actually do

  1. Design, implement, maintain, and support high performance compute and storage systems
  2. Implement and support performance monitoring and fault monitoring systems
  3. Monitor systems and storage performance, up to and including network components
  4. Build tooling to compile, package, install, and upgrade software and operating system components at scale
  5. Collaborate with team members and across teams to write code and testing infrastructures spanning both new and existing codebases in multiple programming languages

Skills

Required

  • 5+ years of professional experience in high performance computing (HPC)
  • 5+ years of experience with Linux systems administration
  • High proficiency with at least one programming/scripting language (e.g., Go, Python, C)
  • Extensive experience designing, building, and maintaining complicated, interdependent, and distributed systems
  • Extensive experience profiling and debugging application stacks (debuggers and profilers)
  • Experience with system configuration management tools (SaltStack, Ansible, Puppet, etc.)

Nice to have

  • parallel filesystems (e.g., Lustre, GPFS)
  • batch systems (e.g., Slurm, Grid Engine)
  • high-performance network interconnects experience

What the JD emphasized

  • high performance computing (HPC)
  • Linux systems administration
  • programming/scripting language (e.g., Go, Python, C)
  • designing, building, and maintaining complicated, interdependent, and distributed systems
  • profiling and debugging application stacks