What you'd actually do

Manage and maintain large-scale host infrastructure across ByteDance's non-China data centers, covering OS lifecycle management, configuration standardization, and fleet-wide health monitoring.

Own the reliability and availability of core data center foundational services, including DNS, NTP, DHCP, NAT, APT repository, and Kerberos authentication.

Design and implement deployment architectures for foundational services, ensuring high availability, fault tolerance, and disaster recovery across regions.

Develop and enforce SLOs for managed services; lead incident response, root cause analysis, and post-mortem reviews to drive continuous reliability improvements.

Identify automation opportunities across host management and service operations; drive tooling and process improvements to reduce toil and increase operational efficiency.

Skills

Required

large-scale Linux host management
OS deployment
configuration management
patching
fleet operations
core data center foundational services (DNS, NTP, DHCP, NAT, APT repository, Kerberos)
DevOps tooling
configuration management tools (Ansible, Salt, Puppet)
CI/CD pipelines
SRE principles and practices
SLO/SLI definition
error budget management
blameless post-mortems
high availability design patterns
active-active/active-passive architectures
disaster recovery strategies
Linux system stack troubleshooting
network layer troubleshooting

Nice to have

managing host fleets at scale (thousands of nodes or above) in a production environment
scripting or development experience in Python, Go, or Bash for automation and tooling
hybrid or multi-region data center environments

The Global System Service team owns the infrastructure services and management solutions that power ByteDance's data centers outside of China — from day-to-day operations to long-term architecture design and maintenance. The team specializes in composing end-to-end solutions by drawing on both open-source community tools and in-house developed products, tailored to both the business requirements and the operational complexities of large-scale infrastructure across ByteDance's non-China regions. Our mission is to deliver efficient infrastructure solutions and a stable, secure system environment for ByteDance's global business.

We are looking for a self-motivated system engineer that is equipped with SRE mindset and DevOps skills. Your responsibilities will include:

Manage and maintain large-scale host infrastructure across ByteDance's non-China data centers, covering OS lifecycle management, configuration standardization, and fleet-wide health monitoring.
Own the reliability and availability of core data center foundational services, including DNS, NTP, DHCP, NAT, APT repository, and Kerberos authentication.
Design and implement deployment architectures for foundational services, ensuring high availability, fault tolerance, and disaster recovery across regions.
Develop and enforce SLOs for managed services; lead incident response, root cause analysis, and post-mortem reviews to drive continuous reliability improvements.
Collaborate with network, security, and application teams to ensure foundational services meet the evolving demands of global business growth.
Identify automation opportunities across host management and service operations; drive tooling and process improvements to reduce toil and increase operational efficiency.

Requirements

Minimum Qualifications:

Bachelor’s degree or higher in Electrical Engineering, Computer Engineering, Computer Science or related majors.
Solid experience in large-scale Linux host management, including OS deployment, configuration management, patching, and fleet operations.
Strong hands-on knowledge of core data center foundational services: DNS (BIND/PowerDNS), NTP, DHCP, NAT, APT repository management, and Kerberos.
Proficiency with DevOps tooling, including configuration management tools (e.g., Ansible, Salt, Puppet) and CI/CD pipelines.
Familiarity with SRE principles and practices, including SLO/SLI definition, error budget management, and blameless post-mortems.
Solid understanding of high availability design patterns, active-active/active-passive architectures, and disaster recovery strategies.
Strong troubleshooting skills across the Linux system stack and network layer.

Preferred Qualifications:

Experience managing host fleets at scale (thousands of nodes or above) in a production environment.
Scripting or development experience in Python, Go, or Bash for automation and tooling.
Exposure to hybrid or multi-region data center environments.

We are looking for a self-motivated system engineer that is equipped with SRE mindset and DevOps skills. Your responsibilities will include:

Manage and maintain large-scale host infrastructure across ByteDance's non-China data centers, covering OS lifecycle management, configuration standardization, and fleet-wide health monitoring.
Own the reliability and availability of core data center foundational services, including DNS, NTP, DHCP, NAT, APT repository, and Kerberos authentication.
Design and implement deployment architectures for foundational services, ensuring high availability, fault tolerance, and disaster recovery across regions.
Develop and enforce SLOs for managed services; lead incident response, root cause analysis, and post-mortem reviews to drive continuous reliability improvements.
Collaborate with network, security, and application teams to ensure foundational services meet the evolving demands of global business growth.
Identify automation opportunities across host management and service operations; drive tooling and process improvements to reduce toil and increase operational efficiency.

Requirements

Minimum Qualifications:

Bachelor’s degree or higher in Electrical Engineering, Computer Engineering, Computer Science or related majors.
Solid experience in large-scale Linux host management, including OS deployment, configuration management, patching, and fleet operations.
Strong hands-on knowledge of core data center foundational services: DNS (BIND/PowerDNS), NTP, DHCP, NAT, APT repository management, and Kerberos.
Proficiency with DevOps tooling, including configuration management tools (e.g., Ansible, Salt, Puppet) and CI/CD pipelines.
Familiarity with SRE principles and practices, including SLO/SLI definition, error budget management, and blameless post-mortems.
Solid understanding of high availability design patterns, active-active/active-passive architectures, and disaster recovery strategies.
Strong troubleshooting skills across the Linux system stack and network layer.

Preferred Qualifications:

Experience managing host fleets at scale (thousands of nodes or above) in a production environment.
Scripting or development experience in Python, Go, or Bash for automation and tooling.
Exposure to hybrid or multi-region data center environments.