Tech Lead, Site Reliability Engineering – Global Traffic Platform

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Tech Lead for Site Reliability Engineering focusing on a Global Traffic Platform. Responsibilities include defining and executing SLO strategy, managing release and change governance, building a global on-call model, leading incident postmortems, and driving stability programs. Requires strong SRE methodology, CI/CD, resilience, observability, cloud-native, and engineering leadership experience.

What you'd actually do

  1. SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services.
  2. Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance.
  3. Incident Management & On-call: Build a global 24x7 follow-the-sun on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time.
  4. Postmortems & Stability Programs: Lead major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices.
  5. Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.

Skills

Required

  • SRE Methodology
  • CI/CD & Progressive Delivery
  • Resilience & Disaster Recovery
  • Observability
  • Cloud-native & Networking Fundamentals
  • Engineering & Team Leadership

Nice to have

  • scaling and operating a global follow-the-sun on-call and incident command process
  • eBPF-based observability and diagnosis toolchains
  • edge traffic infrastructure operations
  • effective communication with global teams
  • managing teams of 8–10 engineers

What the JD emphasized

  • proven production adoption experience in large-scale online systems
  • Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags
  • Familiar with multi-active architectures, cross-region DR, fault domain design, and degradation strategies
  • Hands-on experience across metrics, logs, tracing and profiling
  • Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast
  • Experience leading cross-functional stability initiatives, defining technical standards and on-call practices, and building a learning-oriented SRE team
  • Proven track record of scaling and operating a global follow-the-sun on-call and incident command process across regions/time zones
  • Depth in eBPF-based observability and diagnosis toolchains, and/or edge traffic infrastructure (global load balancing/Anycast, CDN/edge runtime) operations
  • Demonstrated ability to communicate effectively with global teams across multiple time zones
  • Experience managing teams of 8–10 engineers in a global, distributed environment