Tech Lead Cloud Site Reliability Engineer - Dcs Cloud

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Tech Lead Cloud Site Reliability Engineer responsible for building, scaling, and operating ByteDance's global hyper-scale infrastructure, including public and private clouds. The role involves developing automation tools, managing cloud AMIs, and engaging in technical operations and on-call rotations. Requires strong experience in Linux operations, SRE, or DevOps, proficiency in programming languages like Go or Python, and deep understanding of computer science fundamentals, reliability practices, and cloud platforms.

What you'd actually do

  1. Design, build, scale, and operate ByteDance’s global infrastructure, including large-scale systems spanning public and private clouds.
  2. Develop tools, automation frameworks, visualizations, and monitoring systems to streamline operations and drive optimization of global infrastructure.
  3. Create, manage, and standardize cloud AMIs/images for use across multiple environments, ensuring strict alignment with the company's global compliance standards.
  4. Thrive in a fast-paced environment, engaging in technical operations and on-call rotations to address incidents related to cloud, OS, network, performance, and reliability.
  5. Drive improvements across the entire infrastructure lifecycle, from ideation and design through development, deployment, user support, and continuous refinement.

Skills

Required

  • Linux operations
  • SRE
  • DevOps
  • Go
  • Python
  • C++
  • platform development
  • system tooling
  • automation
  • Linux OS principles
  • computer networks
  • storage systems
  • GPU systems
  • databases
  • troubleshooting
  • root-cause analysis
  • monitoring and alerting
  • capacity management
  • change management
  • canary/gray releases
  • incident response
  • postmortem processes
  • communication
  • collaboration
  • ownership
  • results-oriented mindset

Nice to have

  • operating public cloud platforms
  • OCI
  • AWS
  • Azure
  • GCP
  • cloud host delivery
  • image/AMI systems
  • resource scheduling
  • network adaptation
  • virtualization technologies
  • KVM/QEMU
  • containers
  • Kubernetes
  • containerd
  • cgroups
  • namespaces
  • GPU clusters
  • drivers
  • CUDA
  • MIG
  • topology awareness
  • stress testing
  • GPU delivery pipelines
  • failure drill systems
  • capacity governance
  • change governance
  • observability platforms
  • resource cost optimization
  • Open-source contributions
  • technical blogs
  • patents
  • technical sharing
  • large-scale production environments

What the JD emphasized

  • 5+ years of experience in Linux operations, SRE, or DevOps
  • Proficient in at least one programming language such as Go, Python, or C++
  • Strong computer science fundamentals, with deep understanding of Linux OS principles, computer networks, storage systems, GPU systems, and databases
  • Familiar with core reliability practices, including monitoring and alerting, capacity management, change management, canary/gray releases, incident response, and postmortem processes.
  • Strong communication and collaboration skills, with the ability to proactively identify problems, drive cross-team execution, and demonstrate strong ownership and results-oriented mindset.