Tech Lead Cloud Site Reliability Engineer - Dcs Cloud

ByteDance · Big Tech · San Jose, CA · R&D

Tech Lead Cloud Site Reliability Engineer responsible for building, scaling, and operating ByteDance's global hyper-scale infrastructure, including public and private clouds, server fleet management, and cloud solutions. The role involves developing automation tools, managing cloud images, ensuring compliance, and handling technical operations and incident response. Requires strong experience in Linux operations, SRE, DevOps, platform development, and system fundamentals, with preferred experience in public cloud platforms, Kubernetes, and GPU cluster operations.

What you'd actually do

Design, build, scale, and operate ByteDance’s global infrastructure, including large-scale systems spanning public and private clouds.
Develop tools, automation frameworks, visualizations, and monitoring systems to streamline operations and drive optimization of global infrastructure.
Create, manage, and standardize cloud AMIs/images for use across multiple environments, ensuring strict alignment with the company's global compliance standards.
Thrive in a fast-paced environment, engaging in technical operations and on-call rotations to address incidents related to cloud, OS, network, performance, and reliability.
Drive improvements across the entire infrastructure lifecycle, from ideation and design through development, deployment, user support, and continuous refinement.

Skills

Required

Linux operations
SRE
DevOps
Go
Python
C++
platform development
system tooling
automation
Linux OS principles
computer networks
storage systems
GPU systems
databases
troubleshooting
root-cause analysis
monitoring and alerting
capacity management
change management
incident response
postmortem processes
communication
collaboration
ownership
results-oriented mindset

Nice to have

public cloud platforms (OCI, AWS, Azure, GCP)
cloud host delivery
image/AMI systems
resource scheduling
network adaptation
virtualization technologies (KVM/QEMU)
Docker
Kubernetes
containerd
cgroups
namespaces
CUDA
MIG
topology awareness
stress testing
failure drill systems
capacity governance
change governance
observability platforms
resource cost optimization
open-source contributions
technical blogs
patents
technical sharing
large-scale production environments

What the JD emphasized

5+ years of experience in Linux operations, SRE, or DevOps
Proficient in at least one programming language such as Go, Python, or C++, with solid engineering capabilities in platform development, system tooling, and automation.
Strong computer science fundamentals, with deep understanding of Linux OS principles, computer networks, storage systems, GPU systems, and databases, along with systematic troubleshooting and root-cause analysis skills.
Familiar with core reliability practices, including monitoring and alerting, capacity management, change management, canary/gray releases, incident response, and postmortem processes.
Strong communication and collaboration skills, with the ability to proactively identify problems, drive cross-team execution, and demonstrate strong ownership and results-oriented mindset.
Experience maintaining GPU clusters, including drivers, CUDA, MIG, topology awareness, troubleshooting, stress testing, and GPU delivery pipelines.
Proven experience in reliability-focused initiatives such as failure drill systems, capacity governance, change governance, observability platforms, and resource cost optimization.

Read full job description

Our Infrastructure Engineering team supports the company's fast growth by building and operating hyper-scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services and making sure they are scalable and are reliable.

We have three subgroups for this role:

Cloud Host Delivery, Delivery & Standardization
Cloud Host Operation, Operation Efficiency & Reliability
Cloud Management & Security

Responsibilities - What You'll Do

Design, build, scale, and operate ByteDance’s global infrastructure, including large-scale systems spanning public and private clouds.
Develop tools, automation frameworks, visualizations, and monitoring systems to streamline operations and drive optimization of global infrastructure.
Create, manage, and standardize cloud AMIs/images for use across multiple environments, ensuring strict alignment with the company's global compliance standards.
Thrive in a fast-paced environment, engaging in technical operations and on-call rotations to address incidents related to cloud, OS, network, performance, and reliability.
Drive improvements across the entire infrastructure lifecycle, from ideation and design through development, deployment, user support, and continuous refinement.

Requirements

Minimal Qualifications

Bachelor’s degree or above in Computer Science, Software Engineering, Information Security, or a related field.
5+ years of experience in Linux operations, SRE, or DevOps
Proficient in at least one programming language such as Go, Python, or C++, with solid engineering capabilities in platform development, system tooling, and automation.
Strong computer science fundamentals, with deep understanding of Linux OS principles, computer networks, storage systems, GPU systems, and databases, along with systematic troubleshooting and root-cause analysis skills.
Familiar with core reliability practices, including monitoring and alerting, capacity management, change management, canary/gray releases, incident response, and postmortem processes.
Strong communication and collaboration skills, with the ability to proactively identify problems, drive cross-team execution, and demonstrate strong ownership and results-oriented mindset.

Preferred Qualifications

Hands-on experience operating public cloud platforms, or deep familiarity with major cloud providers such as OCI, AWS, Azure, GCP, etc, including understanding of their underlying mechanisms.
Experience with large-scale cloud host delivery, image/AMI systems, resource scheduling, network adaptation, and virtualization technologies such as KVM/QEMU.
Familiar with containers and cloud-native ecosystems, including Docker, Kubernetes, and containerd, with a solid understanding of isolation mechanisms like cgroups and namespaces.
Experience maintaining GPU clusters, including drivers, CUDA, MIG, topology awareness, troubleshooting, stress testing, and GPU delivery pipelines.
Proven experience in reliability-focused initiatives such as failure drill systems, capacity governance, change governance, observability platforms, and resource cost optimization.
Open-source contributions, technical blogs, patents, or technical sharing experience are highly preferred.
Experience operating large-scale production environments is a strong plus.

We have three subgroups for this role:

Cloud Host Delivery, Delivery & Standardization
Cloud Host Operation, Operation Efficiency & Reliability
Cloud Management & Security

Responsibilities - What You'll Do

Design, build, scale, and operate ByteDance’s global infrastructure, including large-scale systems spanning public and private clouds.
Develop tools, automation frameworks, visualizations, and monitoring systems to streamline operations and drive optimization of global infrastructure.
Create, manage, and standardize cloud AMIs/images for use across multiple environments, ensuring strict alignment with the company's global compliance standards.
Thrive in a fast-paced environment, engaging in technical operations and on-call rotations to address incidents related to cloud, OS, network, performance, and reliability.
Drive improvements across the entire infrastructure lifecycle, from ideation and design through development, deployment, user support, and continuous refinement.

Requirements

Minimal Qualifications

Bachelor’s degree or above in Computer Science, Software Engineering, Information Security, or a related field.
5+ years of experience in Linux operations, SRE, or DevOps
Proficient in at least one programming language such as Go, Python, or C++, with solid engineering capabilities in platform development, system tooling, and automation.
Strong computer science fundamentals, with deep understanding of Linux OS principles, computer networks, storage systems, GPU systems, and databases, along with systematic troubleshooting and root-cause analysis skills.
Familiar with core reliability practices, including monitoring and alerting, capacity management, change management, canary/gray releases, incident response, and postmortem processes.
Strong communication and collaboration skills, with the ability to proactively identify problems, drive cross-team execution, and demonstrate strong ownership and results-oriented mindset.

Preferred Qualifications

Hands-on experience operating public cloud platforms, or deep familiarity with major cloud providers such as OCI, AWS, Azure, GCP, etc, including understanding of their underlying mechanisms.
Experience with large-scale cloud host delivery, image/AMI systems, resource scheduling, network adaptation, and virtualization technologies such as KVM/QEMU.
Familiar with containers and cloud-native ecosystems, including Docker, Kubernetes, and containerd, with a solid understanding of isolation mechanisms like cgroups and namespaces.
Experience maintaining GPU clusters, including drivers, CUDA, MIG, topology awareness, troubleshooting, stress testing, and GPU delivery pipelines.
Proven experience in reliability-focused initiatives such as failure drill systems, capacity governance, change governance, observability platforms, and resource cost optimization.
Open-source contributions, technical blogs, patents, or technical sharing experience are highly preferred.
Experience operating large-scale production environments is a strong plus.