Tech Lead Cloud Site Reliability Engineer - Dcs Cloud

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Tech Lead Cloud Site Reliability Engineer responsible for building, scaling, and operating ByteDance's global hyper-scale infrastructure, including public and private clouds, server fleet management, and cloud solutions. The role involves developing automation tools, managing cloud images, ensuring compliance, and handling technical operations and incident response. Requires strong experience in Linux operations, SRE, DevOps, platform development, and system fundamentals, with preferred experience in public cloud platforms, Kubernetes, and GPU cluster operations.

What you'd actually do

  1. Design, build, scale, and operate ByteDance’s global infrastructure, including large-scale systems spanning public and private clouds.
  2. Develop tools, automation frameworks, visualizations, and monitoring systems to streamline operations and drive optimization of global infrastructure.
  3. Create, manage, and standardize cloud AMIs/images for use across multiple environments, ensuring strict alignment with the company's global compliance standards.
  4. Thrive in a fast-paced environment, engaging in technical operations and on-call rotations to address incidents related to cloud, OS, network, performance, and reliability.
  5. Drive improvements across the entire infrastructure lifecycle, from ideation and design through development, deployment, user support, and continuous refinement.

Skills

Required

  • Linux operations
  • SRE
  • DevOps
  • Go
  • Python
  • C++
  • platform development
  • system tooling
  • automation
  • Linux OS principles
  • computer networks
  • storage systems
  • GPU systems
  • databases
  • troubleshooting
  • root-cause analysis
  • monitoring and alerting
  • capacity management
  • change management
  • incident response
  • postmortem processes
  • communication
  • collaboration
  • ownership
  • results-oriented mindset

Nice to have

  • public cloud platforms (OCI, AWS, Azure, GCP)
  • cloud host delivery
  • image/AMI systems
  • resource scheduling
  • network adaptation
  • virtualization technologies (KVM/QEMU)
  • Docker
  • Kubernetes
  • containerd
  • cgroups
  • namespaces
  • CUDA
  • MIG
  • topology awareness
  • stress testing
  • failure drill systems
  • capacity governance
  • change governance
  • observability platforms
  • resource cost optimization
  • open-source contributions
  • technical blogs
  • patents
  • technical sharing
  • large-scale production environments

What the JD emphasized

  • 5+ years of experience in Linux operations, SRE, or DevOps
  • Proficient in at least one programming language such as Go, Python, or C++, with solid engineering capabilities in platform development, system tooling, and automation.
  • Strong computer science fundamentals, with deep understanding of Linux OS principles, computer networks, storage systems, GPU systems, and databases, along with systematic troubleshooting and root-cause analysis skills.
  • Familiar with core reliability practices, including monitoring and alerting, capacity management, change management, canary/gray releases, incident response, and postmortem processes.
  • Strong communication and collaboration skills, with the ability to proactively identify problems, drive cross-team execution, and demonstrate strong ownership and results-oriented mindset.
  • Experience maintaining GPU clusters, including drivers, CUDA, MIG, topology awareness, troubleshooting, stress testing, and GPU delivery pipelines.
  • Proven experience in reliability-focused initiatives such as failure drill systems, capacity governance, change governance, observability platforms, and resource cost optimization.