About the Team The Global Traffic Infrastructure (GTI) team leverages unified platform capabilities to manage edge infrastructure outside China (both self-built and third-party) providing standardized, compliant, scalable, and cost-effective traffic infrastructure capabilities for edge services. Our vision is to build a global edge traffic infrastructure platform and become the long-term cornerstone of ByteDance’s global edge business in terms of scale, performance, and cost.
Responsibilities
- SLO/SLI & Error Budget: Align with business stability goals; own the overall SLO strategy and execution for the platform; build and operate an SLO/SLI & Error Budget framework covering critical user journeys/services.
- Release & Change Governance: Drive end-to-end release/change management across code, configuration, network and capacity; establish standardized change reviews, canary/phased rollout strategies, rollback mechanisms, and release window governance.
- Incident Management & On-call: Participate in our own on-call model; unify incident processes (triage, response, escalation, communication) to reduce blast radius and recovery time.
- Postmortems & Stability Programs: Participate in major incident postmortems; drive cross-team stability programs (e.g., chaos engineering, capacity stress testing, SPOF elimination); distill reusable best practices.
- Design for Operability: Partner closely with platform engineering and network/infrastructure teams to shift-left operability and reliability requirements into architectural design and development workflows.
Requirements
Minimum Qualifications
- 3+ years of experience in SRE/DevOps/Production Engineering/Infrastructure Backend roles, supporting large-scale online systems.
- SRE Methodology: Strong grasp of SLO/SLI, Error Budget, incident management, and postmortems, with proven production adoption experience in large-scale online systems.
- CI/CD & Progressive Delivery: Deep understanding of CI/CD pipelines and deployment patterns such as blue-green, canary, phased rollout, configuration management, and feature flags; able to design and promote a unified change governance system.
- Observability: Hands-on experience across metrics, logs, tracing and profiling; familiarity with eBPF-based approaches to improve observability and troubleshooting efficiency.
- Cloud-native & Networking Fundamentals: Understanding of Kubernetes, CNI, traffic management, and global LB/Anycast; practical exposure to self-built CDN/edge node runtimes.
Preferred Qualifications
- Proven track record of scaling and operating a global follow-the-sun on-call and incident command process across regions/time zones.
- Depth in eBPF-based observability and diagnosis toolchains, and/or edge traffic infrastructure (global load balancing/Anycast, CDN/edge runtime) operations.
- Demonstrated ability to communicate effectively with global teams across multiple time zones; experience complex cross-border technical collaborations.