Senior Sre - Global Traffic Infrastructure

ByteDance · Big Tech · San Jose, CA · R&D

Senior SRE role focused on building and operating a global edge traffic infrastructure platform. Responsibilities include defining and executing SLO strategy, managing release and change governance, leading incident response, driving stability programs, and ensuring operability in system design. Requires strong experience in SRE methodologies, CI/CD, observability, and cloud-native/networking fundamentals.

What you'd actually do

SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services.
Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance.
Incident Management & On-call: Participate in our own on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time.
Postmortems & Stability Programs: Participate in major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices.
Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.

Skills

Required

SRE Methodology (SLO/SLI, Error Budget, incident management, postmortems)
CI/CD pipelines
Progressive Delivery patterns (blue-green, canary, phased rollout)
Configuration management
Feature flags
Metrics, logs, tracing, profiling
Kubernetes
CNI
Traffic management
Global LB/Anycast
Edge node runtimes

Nice to have

eBPF-based observability and diagnosis toolchains
Edge traffic infrastructure operations
Global follow-the-sun on-call and incident command process
Cross-border technical collaborations

What the JD emphasized

3+ years of experience in SRE/DevOps/Production Engineering/Infrastructure Backend roles, supporting large-scale online systems.
SRE Methodology: Strong grasp of SLO/SLI, Error Budget, incident management, and postmortems, with proven production adoption experience in large-scale online systems.
CI/CD & Progressive Delivery: Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags; able to design and promote a unified change governance system.
Observability: Hands-on experience across metrics, logs, tracing and profiling; familiarity with eBPF-based approaches to improve observability and troubleshooting efficiency.
Cloud-native & Networking Fundamentals: Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast; practical exposure to self-built CDN/edge node runtimes.

Read full job description

About the Team The Global Traffic Infrastructure (GTI) team leverages unified platform capabilities to manage edge infrastructure outside China (both self-built and third-party) providing standardized, compliant, scalable, and cost-effective traffic infrastructure capabilities for edge services. Our vision is to build a global edge traffic infrastructure platform and become the long-term cornerstone of ByteDance’s global edge business in terms of scale, performance, and cost.

Responsibilities

SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services.
Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance.
Incident Management & On-call: Participate in our own on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time.
Postmortems & Stability Programs: Participate in major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices.
Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.

Requirements

Minimum Qualifications

3+ years of experience in SRE/DevOps/Production Engineering/Infrastructure Backend roles, supporting large-scale online systems.
SRE Methodology: Strong grasp of SLO/SLI, Error Budget, incident management, and postmortems, with proven production adoption experience in large-scale online systems.
CI/CD & Progressive Delivery: Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags; able to design and promote a unified change governance system.
Observability: Hands-on experience across metrics, logs, tracing and profiling; familiarity with eBPF-based approaches to improve observability and troubleshooting efficiency.
Cloud-native & Networking Fundamentals: Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast; practical exposure to self-built CDN/edge node runtimes.

Preferred Qualifications

Proven track record of scaling and operating a global follow-the-sun on-call and incident command process across regions/time zones.
Depth in eBPF-based observability and diagnosis toolchains, and/or edge traffic infrastructure (global load balancing/Anycast, CDN/edge runtime) operations.
Demonstrated ability to communicate effectively with global teams across multiple time zones; experience complex cross-border technical collaborations.

Responsibilities

SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services.
Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance.
Incident Management & On-call: Participate in our own on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time.
Postmortems & Stability Programs: Participate in major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices.
Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.

Requirements

Minimum Qualifications

3+ years of experience in SRE/DevOps/Production Engineering/Infrastructure Backend roles, supporting large-scale online systems.
SRE Methodology: Strong grasp of SLO/SLI, Error Budget, incident management, and postmortems, with proven production adoption experience in large-scale online systems.
CI/CD & Progressive Delivery: Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags; able to design and promote a unified change governance system.
Observability: Hands-on experience across metrics, logs, tracing and profiling; familiarity with eBPF-based approaches to improve observability and troubleshooting efficiency.
Cloud-native & Networking Fundamentals: Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast; practical exposure to self-built CDN/edge node runtimes.

Preferred Qualifications

Proven track record of scaling and operating a global follow-the-sun on-call and incident command process across regions/time zones.
Depth in eBPF-based observability and diagnosis toolchains, and/or edge traffic infrastructure (global load balancing/Anycast, CDN/edge runtime) operations.
Demonstrated ability to communicate effectively with global teams across multiple time zones; experience complex cross-border technical collaborations.