Tech Lead - Data Infrastructure Site Reliability

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Tech Lead for Site Reliability Engineering (SRE) focused on building and operating large-scale data infrastructure with high reliability and efficiency. The role involves deep technical leadership, architectural improvements, and collaboration with cross-functional teams to deliver resilient, scalable platforms. Responsibilities include designing, developing, and operating cloud infrastructure, automating operations, troubleshooting complex issues, and promoting best practices in system design and observability.

What you'd actually do

  1. Strong hands-on skills in the design, development, and operation of large-scale cloud infrastructure and distributed systems.
  2. Collaborate with cross-functional teams (e.g., Advertising, Machine Learning, E-commerce, and Core Infra) to drive system reliability, performance, and scalability.
  3. Lead initiatives to automate operations, eliminate toil, and improve overall system efficiency.
  4. Troubleshoot complex production issues, perform root-cause analysis, and drive long-term reliability improvements.
  5. Promote best practices in system design, observability, performance optimization, and cost efficiency.

Skills

Required

  • Site Reliability Engineering
  • Software Development
  • Cloud-based systems design, building, scaling, and operation
  • Databases (SQL/NoSQL)
  • Kubernetes or container orchestration
  • Big Data processing and storage systems (streaming and batch)
  • System architecture
  • Distributed systems
  • Performance bottlenecks analysis
  • Communication skills
  • Collaboration skills

Nice to have

  • Automation
  • Tooling
  • Process improvements
  • Cost optimization
  • Performance tuning
  • Data-driven decision making
  • Adopting new technologies
  • Improving operational practices
  • Influencing system design

What the JD emphasized

  • 5+ years of experience in Site Reliability Engineering, Software Development, or related fields, with a strong focus on designing, building, scaling, and operating cloud-based systems.
  • Deep hands-on expertise in at least one of the following areas: Databases (SQL/NoSQL), Kubernetes or container orchestration, Big Data processing and storage systems (streaming and batch)
  • Strong knowledge of system architecture, distributed systems, and performance bottlenecks.