Sr Site Reliability Engineer

The Trade Desk The Trade Desk · Media · Sydney, Australia · Site Reliability Engineering

Senior Software Engineer role focused on designing, building, and scaling a global network platform across physical datacenters and multi-cloud environments. The role involves network automation, troubleshooting complex issues, and improving operational excellence. Experience with Kubernetes networking, load balancers, and infrastructure-as-code is required. Proficiency in Python or Go for automation and experience integrating AI tools into engineering processes are also mentioned.

What you'd actually do

  1. Design, build, and scale a global network platform spanning physical datacenters and multi-cloud environments across AWS, Azure, and Alibaba Cloud.
  2. Support thousands of hosts worldwide, engineering reliable and efficient solutions to petabyte-scale data challenges.
  3. Own troubleshooting and resolution of complex network issues, upholding high availability and performance across the entire infrastructure footprint.
  4. Lead root cause analysis and postmortems, turning incidents into actionable improvements that raise the bar for operational excellence.
  5. Eliminate toil by building tools, automating workflows, and continuously improving the processes your team depends on every day.

Skills

Required

  • 6-8 years of hands on network automation and operational experience supporting large scale production infrastructure
  • strong development and networking experience
  • deep expertise in TCP/IP, the OSI model, and large-scale IP networking protocols including BGP and OSPF
  • hands-on experience with Kubernetes networking technologies such as Cilium and Calico
  • solid understanding of container network interfaces (CNIs)
  • managed software load balancers like NGINX Ingress, Envoy, or HAProxy in large-scale production environments
  • skilled at troubleshooting and performance tuning in Kubernetes and Docker environments, with a focus on networking
  • operated network devices at scale using network operating systems such as SONiC, Cisco IOS, JunOS, Arista EOS, or Nokia SR Linux/SR OS
  • comfortable with monitoring and alerting systems, writing complex rules and time-series queries using tools like Prometheus and Grafana
  • practice infrastructure-as-code and apply DevOps and SRE principles
  • build robust workflows and pipelines to test and safely deploy changes to production
  • Proficient creating automation and building tools using Python or Go
  • Experience integrating AI tools (LLMs, MCP, agentic workflows) into engineering processes to automate tasks and improve development velocity

Nice to have

  • Experience running Kubernetes clusters on bare-metal is a plus
  • interest or background in platform engineering

What the JD emphasized

  • 6-8 years of hands on network automation and operational experience supporting large scale production infrastructure
  • deep networking expertise
  • software craftsmanship
  • scalable, maintainable solutions
  • obsessive drive to keep networks healthy, performant, and resilient
  • complex network issues
  • operational excellence
  • large scale production infrastructure
  • large-scale IP networking protocols
  • large-scale production environments
  • troubleshooting and performance tuning
  • operated network devices at scale
  • writing complex rules and time-series queries
  • infrastructure-as-code
  • DevOps and SRE principles
  • build robust workflows and pipelines
  • large-scale, distributed systems
  • resilient, always-on networks
  • evaluates ROI, implementation complexity, and customer impact
  • reduce complexity, mitigate risk
  • scaling cost-effective
  • high-impact contributions
  • long-horizon projects
  • distil complex technical topics
  • drive alignment across teams
  • broader context and motivations
  • work fluidly across engineering disciplines
  • bring people together around a shared goal