Senior Site Reliability Engineer - Foundational Storage, Bytestore

ByteDance ByteDance · Big Tech · Seattle, WA · Infrastructure

Senior Site Reliability Engineer for ByteDance's self-developed distributed storage system, ByteStore. The role focuses on reliability engineering, operational automation, system performance, capacity planning, and incident response to ensure high availability, durability, and low latency for various storage and computing products. Responsibilities include designing SRE tooling, managing deployments, researching new reliability technologies, partnering with business teams, participating in on-call duties, and troubleshooting complex cross-layer issues.

What you'd actually do

  1. Design, develop, and maintain SRE tooling – Build automation for deployment, configuration management, failover, disaster recovery drills, and capacity planning to reduce operational toil.
  2. Participate in deployment and release processes – Lead safe change management (canary, gradual rollout, rollback) for ByteStore components across production environments.
  3. Research cutting‑edge storage & SRE technologies – Explore new reliability patterns, chaos engineering, observability techniques, and cost‑efficient storage hardware/software.
  4. Partner with business teams – Continuously improve the stability, functionality, and performance of ByteStore from an operational perspective, ensuring downstream products (databases, queues, object storage) have reliable storage.
  5. Participate in online operational changes, on‑call duties, and issue investigation – Join on‑call rotations, respond to incidents, drive root cause analysis, and lead post‑mortems to prevent recurrence.

Skills

Required

  • C++
  • distributed storage systems
  • troubleshooting
  • problem-solving
  • teamwork
  • communication
  • customer service

Nice to have

  • NVMe
  • SPDK/DPDK
  • metadata subsystems
  • unit/functional/system testing
  • Linux performance tuning
  • large-scale distributed storage systems production SRE/operations
  • distributed systems
  • HDFS
  • Ceph
  • GlusterFS
  • Prometheus
  • Grafana
  • ELK
  • chaos engineering
  • capacity planning

What the JD emphasized

  • Strong coding skills in C++, with a solid code quality awareness (essential for writing reliable automation and debugging core storage components).