Tech Lead, Site Reliability Engineering – Global Traffic Platform

ByteDance · Big Tech · Seattle, WA · R&D

Tech Lead for Site Reliability Engineering focusing on a Global Traffic Platform. Responsibilities include defining and executing SLO strategy, managing release and change governance, building a global on-call model, leading incident postmortems, and driving stability programs. Requires strong SRE methodology, CI/CD, resilience, observability, cloud-native, and engineering leadership experience.

What you'd actually do

SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services.
Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance.
Incident Management & On-call: Build a global 24x7 follow-the-sun on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time.
Postmortems & Stability Programs: Lead major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices.
Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.

Skills

Required

SRE Methodology
CI/CD & Progressive Delivery
Resilience & Disaster Recovery
Observability
Cloud-native & Networking Fundamentals
Engineering & Team Leadership

Nice to have

scaling and operating a global follow-the-sun on-call and incident command process
eBPF-based observability and diagnosis toolchains
edge traffic infrastructure operations
effective communication with global teams
managing teams of 8–10 engineers

What the JD emphasized

proven production adoption experience in large-scale online systems
Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags
Familiar with multi-active architectures, cross-region DR, fault domain design, and degradation strategies
Hands-on experience across metrics, logs, tracing and profiling
Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast
Experience leading cross-functional stability initiatives, defining technical standards and on-call practices, and building a learning-oriented SRE team
Proven track record of scaling and operating a global follow-the-sun on-call and incident command process across regions/time zones
Depth in eBPF-based observability and diagnosis toolchains, and/or edge traffic infrastructure (global load balancing/Anycast, CDN/edge runtime) operations
Demonstrated ability to communicate effectively with global teams across multiple time zones
Experience managing teams of 8–10 engineers in a global, distributed environment

Read full job description

About the Team The Global Traffic Infrastructure (GTI) team leverages unified platform capabilities to manage edge infrastructure outside China (both self-built and third-party) providing standardized, compliant, scalable, and cost-effective traffic infrastructure capabilities for edge services. Our vision is to build a global edge traffic infrastructure platform and become the long-term cornerstone of ByteDance’s global edge business in terms of scale, performance, and cost.

Responsibilities

SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services.
Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance.
Incident Management & On-call: Build a global 24x7 follow-the-sun on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time.
Postmortems & Stability Programs: Lead major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices.
Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.

Requirements

Minimum Qualifications

SRE Methodology: Strong grasp of SLO/SLI, Error Budget, incident management, and postmortems, with proven production adoption experience in large-scale online systems.
CI/CD & Progressive Delivery: Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags; able to design and promote a unified change governance system.
Resilience & Disaster Recovery: Familiar with multi-active architectures, cross-region DR, fault domain design, and degradation strategies; able to plan and execute chaos engineering exercises.
Observability: Hands-on experience across metrics, logs, tracing and profiling; familiarity with eBPF-based approaches to improve observability and troubleshooting efficiency.
Cloud-native & Networking Fundamentals: Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast; practical exposure to self-built CDN/edge node runtimes.
Engineering & Team Leadership: Experience leading cross-functional stability initiatives, defining technical standards and on-call practices, and building a learning-oriented SRE team.

Preferred Qualifications

Proven track record of scaling and operating a global follow-the-sun on-call and incident command process across regions/time zones.
Depth in eBPF-based observability and diagnosis toolchains, and/or edge traffic infrastructure (global load balancing/Anycast, CDN/edge runtime) operations.
Demonstrated ability to communicate effectively with global teams across multiple time zones; experience leading complex cross-border technical collaborations.
Experience managing teams of 8–10 engineers in a global, distributed environment.

Responsibilities

SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services.
Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance.
Incident Management & On-call: Build a global 24x7 follow-the-sun on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time.
Postmortems & Stability Programs: Lead major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices.
Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.

Requirements

Minimum Qualifications

SRE Methodology: Strong grasp of SLO/SLI, Error Budget, incident management, and postmortems, with proven production adoption experience in large-scale online systems.
CI/CD & Progressive Delivery: Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags; able to design and promote a unified change governance system.
Resilience & Disaster Recovery: Familiar with multi-active architectures, cross-region DR, fault domain design, and degradation strategies; able to plan and execute chaos engineering exercises.
Observability: Hands-on experience across metrics, logs, tracing and profiling; familiarity with eBPF-based approaches to improve observability and troubleshooting efficiency.
Cloud-native & Networking Fundamentals: Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast; practical exposure to self-built CDN/edge node runtimes.
Engineering & Team Leadership: Experience leading cross-functional stability initiatives, defining technical standards and on-call practices, and building a learning-oriented SRE team.

Preferred Qualifications

Proven track record of scaling and operating a global follow-the-sun on-call and incident command process across regions/time zones.
Depth in eBPF-based observability and diagnosis toolchains, and/or edge traffic infrastructure (global load balancing/Anycast, CDN/edge runtime) operations.
Demonstrated ability to communicate effectively with global teams across multiple time zones; experience leading complex cross-border technical collaborations.
Experience managing teams of 8–10 engineers in a global, distributed environment.