Cloud Site Reliability Engineer - Dcs Cloud

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Seeking a Cloud Site Reliability Engineer to build and operate hyper-scale datacenters, manage server fleet lifecycle, provide cloud solutions, and develop scalable and reliable infrastructure services. Responsibilities include designing, building, scaling, and operating global infrastructure, developing automation tools, creating and managing cloud AMIs, engaging in technical operations and on-call rotations, and driving improvements across the infrastructure lifecycle. Requires a Bachelor's degree, 2+ years of experience in Linux operations/SRE/DevOps, proficiency in Go/Python/C++, strong CS fundamentals, and familiarity with reliability practices.

What you'd actually do

  1. Design, build, scale, and operate ByteDance’s global infrastructure, including large-scale systems spanning public and private clouds.
  2. Develop tools, automation frameworks, visualizations, and monitoring systems to streamline operations and drive optimization of global infrastructure.
  3. Create, manage, and standardize cloud AMIs/images for use across multiple environments, ensuring strict alignment with the company's global compliance standards.
  4. Thrive in a fast-paced environment, engaging in technical operations and on-call rotations to address incidents related to cloud, OS, network, performance, and reliability.
  5. Drive improvements across the entire infrastructure lifecycle, from ideation and design through development, deployment, user support, and continuous refinement.

Skills

Required

  • Linux operations
  • SRE
  • DevOps
  • Go
  • Python
  • C++
  • platform development
  • system tooling
  • automation
  • Linux OS principles
  • computer networks
  • storage systems
  • GPU systems
  • databases
  • troubleshooting
  • root-cause analysis
  • monitoring and alerting
  • capacity management
  • change management
  • canary/gray releases
  • incident response
  • postmortem processes
  • communication
  • collaboration
  • ownership
  • results-oriented mindset

Nice to have

  • public cloud platforms
  • OCI
  • AWS
  • Azure
  • GCP
  • large-scale cloud host delivery
  • image/AMI systems
  • resource scheduling
  • network adaptation
  • virtualization technologies
  • KVM/QEMU
  • containers
  • Kubernetes
  • containerd
  • cgroups
  • namespaces
  • GPU clusters
  • drivers
  • CUDA
  • MIG
  • topology awareness
  • stress testing
  • GPU delivery pipelines
  • failure drill systems
  • capacity governance
  • change governance
  • observability platforms
  • resource cost optimization
  • Open-source contributions
  • technical blogs
  • patents
  • technical sharing
  • large-scale production environments

What the JD emphasized

  • strict alignment with the company's global compliance standards