Lead Systems Reliability Engineer (linux & Distributed Systems)

The Trade Desk The Trade Desk · Media · London, United Kingdom · Engineering

Lead Systems Reliability Engineer role focused on building and maintaining a data-driven platform at global scale, involving real-time processing with low latency. Responsibilities include infrastructure automation, owning operations for Linux-based systems (Aerospike, Kafka, Mongo), reviewing new use cases, and benchmarking hardware. The role involves deep performance engineering and collaboration with hardware/software vendors.

What you'd actually do

  1. Lead a team to influence, manage, and plan work streams, systems, and data structures at scale within a global ecosystem, spanning multiple infrastructure providers (cloud and traditional datacenters).
  2. Encourage, improve, and build infrastructure automation in a way that works with stateful systems at scale.
  3. Own operations for Linux-based systems running Aerospike, Kafka, and Mongo.
  4. Serve as a point of contact to review new use cases, answer questions, and participate in on-call rotation.
  5. Learn to be a NoSQL SME. You do not need experience to apply – we will train you.

Skills

Required

  • Linux operating system
  • Leadership experience and ability to mentor
  • Troubleshooting Techniques for isolation, scientific method
  • Identify bottlenecks (Is it CPU? IO?)

Nice to have

  • Physical hardware (on-prem) internals, management, and operation
  • Performing testing and tuning
  • Databases (relational or NoSQL)
  • Ansible/PyInfra/Chef
  • Prometheus
  • Kubernetes
  • Python/Ruby/Rust/Bash/Golang/C#

What the JD emphasized

  • First in the Industry
  • Work on Cutting-Edge Hardware
  • Shape the Future of Infrastructure
  • Deep Performance Engineering
  • Push Hardware Endurance Limits