Lead Site Reliability Engineer | Production Infrastructure

Jump Trading Jump Trading · Quant · Chicago, IL · IT Infrastructure + WCW

Lead SRE role focused on operating and improving Jump Trading's production trading environment, emphasizing high-performance monitoring, alerting, automation, incident management, and debugging for ultra-low latency systems.

What you'd actually do

  1. Design & Build: Architect and implement high-performance monitoring and alerting systems, real-time packet/flow analysis tooling, and automation frameworks for managing Jump’s global production footprint.
  2. Lead Operational Maturity: Oversee and improve incident management, change management, and post-incident review processes to increase resilience and reduce downtime.
  3. Drive Efficiency: Identify and eliminate sources of operational toil through automation and tooling.
  4. Collaborate Globally: Partner with engineering, networking, and trading teams in multiple regions to align technical priorities with business objectives.
  5. Debug Deeply: Investigate low-level performance issues across complex software stacks, optimizing for ultra-low latency and high throughput.

Skills

Required

  • Proven leadership experience having managed people across distributed teams.
  • Demonstrated history of solving reliability challenges in large-scale production environments.
  • Previous experience demonstrating strategic thinking skills and maturity in tackling complex problems, dealing with people, technology and processes.
  • Strong programming skills in Python, Go, or equivalent.

What the JD emphasized

  • ultra-low latency
  • high throughput
  • operational toil