Tech Lead - Data Infrastructure Site Reliability

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

This role is for a Technical Lead in Site Reliability Engineering (SRE) focused on building and operating large-scale data infrastructure with high reliability and efficiency. The role involves deep technical leadership, architectural improvements, and collaboration across engineering, product, data, and infrastructure teams to deliver resilient, scalable platforms. Key responsibilities include designing, developing, and operating cloud infrastructure, automating operations, troubleshooting production issues, and promoting best practices in system design and observability. The role requires strong expertise in areas like databases, Kubernetes, or Big Data systems.

What you'd actually do

  1. Strong hands-on skills in the design, development, and operation of large-scale cloud infrastructure and distributed systems.
  2. Collaborate with cross-functional teams (e.g., Advertising, Machine Learning, E-commerce, and Core Infra) to drive system reliability, performance, and scalability.
  3. Lead initiatives to automate operations, eliminate toil, and improve overall system efficiency.
  4. Troubleshoot complex production issues, perform root-cause analysis, and drive long-term reliability improvements.
  5. Promote best practices in system design, observability, performance optimization, and cost efficiency.

Skills

Required

  • Site Reliability Engineering
  • Software Development
  • Cloud-based systems
  • Databases (SQL/NoSQL)
  • Kubernetes
  • Container orchestration
  • Big Data processing
  • Big Data storage systems
  • System architecture
  • Distributed systems
  • Performance bottlenecks
  • Communication skills
  • Collaboration skills

Nice to have

  • Automation
  • Tooling
  • Process improvements
  • Cost optimization
  • Performance tuning
  • Data-driven decision making
  • Thought leadership
  • Adopting new technologies
  • Improving operational practices
  • Influencing system design

What the JD emphasized

  • 5+ years of experience in Site Reliability Engineering, Software Development, or related fields, with a strong focus on designing, building, scaling, and operating cloud-based systems.
  • Deep hands-on expertise in at least one of the following areas: Databases (SQL/NoSQL), Kubernetes or container orchestration, Big Data processing and storage systems (streaming and batch)
  • Strong knowledge of system architecture, distributed systems, and performance bottlenecks.