Site Reliability Engineer - Foundational Storage, Bytestore

ByteDance ByteDance · Big Tech · Seattle, WA · Infrastructure

Site Reliability Engineer for a self-developed distributed storage system (ByteStore) that supports various storage and computing products. The role involves reliability engineering, operational automation, system performance profiling, capacity planning, and incident response to ensure strict SLOs are met. Responsibilities include designing SRE tooling, participating in deployment processes, researching new reliability technologies, partnering with business teams, participating in operational changes and on-call duties, and troubleshooting complex cross-layer issues.

What you'd actually do

  1. Design, develop, and maintain SRE tooling – Build automation for deployment, configuration management, failover, disaster recovery drills, and capacity planning to reduce operational toil.
  2. Participate in deployment and release processes – Lead safe change management (canary, gradual rollout, rollback) for ByteStore components across production environments.
  3. Research cutting‑edge storage & SRE technologies – Explore new reliability patterns, chaos engineering, observability techniques, and cost‑efficient storage hardware/software.
  4. Partner with business teams – Continuously improve the stability, functionality, and performance of ByteStore from an operational perspective, ensuring downstream products (databases, queues, object storage) have reliable storage.
  5. Participate in online operational changes, on‑call duties, and issue investigation – Join on‑call rotations, respond to incidents, drive root cause analysis, and lead post‑mortems to prevent recurrence.

Skills

Required

  • C++
  • distributed storage systems
  • troubleshooting
  • problem-solving
  • teamwork
  • communication

Nice to have

  • NVMe
  • SPDK/DPDK
  • metadata subsystems
  • unit/functional/system testing
  • Linux performance tuning
  • large-scale distributed storage systems production SRE/operations
  • distributed systems
  • HDFS
  • Ceph
  • GlusterFS
  • observability tools (Prometheus, Grafana, ELK)
  • chaos engineering
  • capacity planning

What the JD emphasized

  • Strong coding skills in C++, with a solid code quality awareness (essential for writing reliable automation and debugging core storage components).