Senior Sre - Global Traffic Infrastructure

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Senior SRE role focused on building and operating a global edge traffic infrastructure platform. Responsibilities include defining and executing SLO strategy, managing release and change governance, leading incident response, driving stability programs, and ensuring operability in system design. Requires strong experience in SRE methodologies, CI/CD, observability, and cloud-native/networking fundamentals.

What you'd actually do

  1. SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services.
  2. Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance.
  3. Incident Management & On-call: Participate in our own on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time.
  4. Postmortems & Stability Programs: Participate in major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices.
  5. Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.

Skills

Required

  • SRE Methodology (SLO/SLI, Error Budget, incident management, postmortems)
  • CI/CD pipelines
  • Progressive Delivery patterns (blue-green, canary, phased rollout)
  • Configuration management
  • Feature flags
  • Metrics, logs, tracing, profiling
  • Kubernetes
  • CNI
  • Traffic management
  • Global LB/Anycast
  • Edge node runtimes

Nice to have

  • eBPF-based observability and diagnosis toolchains
  • Edge traffic infrastructure operations
  • Global follow-the-sun on-call and incident command process
  • Cross-border technical collaborations

What the JD emphasized

  • 3+ years of experience in SRE/DevOps/Production Engineering/Infrastructure Backend roles, supporting large-scale online systems.
  • SRE Methodology: Strong grasp of SLO/SLI, Error Budget, incident management, and postmortems, with proven production adoption experience in large-scale online systems.
  • CI/CD & Progressive Delivery: Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags; able to design and promote a unified change governance system.
  • Observability: Hands-on experience across metrics, logs, tracing and profiling; familiarity with eBPF-based approaches to improve observability and troubleshooting efficiency.
  • Cloud-native & Networking Fundamentals: Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast; practical exposure to self-built CDN/edge node runtimes.