The Global System Service team owns the infrastructure services and management solutions that power ByteDance's data centers outside of China — from day-to-day operations to long-term architecture design and maintenance. The team specializes in composing end-to-end solutions by drawing on both open-source community tools and in-house developed products, tailored to both the business requirements and the operational complexities of large-scale infrastructure across ByteDance's non-China regions. Our mission is to deliver efficient infrastructure solutions and a stable, secure system environment for ByteDance's global business.
We are looking for a self-motivated system engineer that is equipped with SRE mindset and DevOps skills. Your responsibilities will include:
- Manage and maintain large-scale host infrastructure across ByteDance's non-China data centers, covering OS lifecycle management, configuration standardization, and fleet-wide health monitoring.
- Own the reliability and availability of core data center foundational services, including DNS, NTP, DHCP, NAT, APT repository, and Kerberos authentication.
- Design and implement deployment architectures for foundational services, ensuring high availability, fault tolerance, and disaster recovery across regions.
- Develop and enforce SLOs for managed services; lead incident response, root cause analysis, and post-mortem reviews to drive continuous reliability improvements.
- Collaborate with network, security, and application teams to ensure foundational services meet the evolving demands of global business growth.
- Identify automation opportunities across host management and service operations; drive tooling and process improvements to reduce toil and increase operational efficiency.
Requirements
Minimum Qualifications:
- Bachelor’s degree or higher in Electrical Engineering, Computer Engineering, Computer Science or related majors.
- Solid experience in large-scale Linux host management, including OS deployment, configuration management, patching, and fleet operations.
- Strong hands-on knowledge of core data center foundational services: DNS (BIND/PowerDNS), NTP, DHCP, NAT, APT repository management, and Kerberos.
- Proficiency with DevOps tooling, including configuration management tools (e.g., Ansible, Salt, Puppet) and CI/CD pipelines.
- Familiarity with SRE principles and practices, including SLO/SLI definition, error budget management, and blameless post-mortems.
- Solid understanding of high availability design patterns, active-active/active-passive architectures, and disaster recovery strategies.
- Strong troubleshooting skills across the Linux system stack and network layer.
Preferred Qualifications:
- Experience managing host fleets at scale (thousands of nodes or above) in a production environment.
- Scripting or development experience in Python, Go, or Bash for automation and tooling.
- Exposure to hybrid or multi-region data center environments.