Site Reliability Engineer - System Service Global

ByteDance ByteDance · Big Tech · San Jose, CA · Infrastructure

Site Reliability Engineer responsible for managing and maintaining large-scale host infrastructure and foundational services in ByteDance's global data centers, focusing on reliability, availability, and automation.

What you'd actually do

  1. Manage and maintain large-scale host infrastructure across ByteDance's non-China data centers, covering OS lifecycle management, configuration standardization, and fleet-wide health monitoring.
  2. Own the reliability and availability of core data center foundational services, including DNS, NTP, DHCP, NAT, APT repository, and Kerberos authentication.
  3. Design and implement deployment architectures for foundational services, ensuring high availability, fault tolerance, and disaster recovery across regions.
  4. Develop and enforce SLOs for managed services; lead incident response, root cause analysis, and post-mortem reviews to drive continuous reliability improvements.
  5. Identify automation opportunities across host management and service operations; drive tooling and process improvements to reduce toil and increase operational efficiency.

Skills

Required

  • large-scale Linux host management
  • OS deployment
  • configuration management
  • patching
  • fleet operations
  • core data center foundational services (DNS, NTP, DHCP, NAT, APT repository, Kerberos)
  • DevOps tooling
  • configuration management tools (Ansible, Salt, Puppet)
  • CI/CD pipelines
  • SRE principles and practices
  • SLO/SLI definition
  • error budget management
  • blameless post-mortems
  • high availability design patterns
  • active-active/active-passive architectures
  • disaster recovery strategies
  • Linux system stack troubleshooting
  • network layer troubleshooting

Nice to have

  • managing host fleets at scale (thousands of nodes or above) in a production environment
  • scripting or development experience in Python, Go, or Bash for automation and tooling
  • hybrid or multi-region data center environments