Senior Site Reliability Engineer - Data Infrastructure (san Jose)

ByteDance ByteDance · Big Tech · San Jose, CA · Infrastructure

Senior Site Reliability Engineer focused on the reliability, scalability, and efficiency of core data services in a massive, distributed environment. Responsibilities include incident response, SLO management, capacity optimization, automation, and AI infrastructure maintenance. This role is primarily focused on operational work and platform resilience, not feature development.

What you'd actually do

  1. Incident response and postmortems: Act as an incident commander for critical production issues, guiding the team through triage and resolution. Drive deep, blameless post-incident reviews and ensure that follow-up actions are implemented to prevent recurrence.
  2. SLO/SLA and error budgets: Define, negotiate, and maintain Service Level Objectives (SLOs) for critical data services. Champion the use of error budgets to balance reliability work with feature development.
  3. Capacity and cost optimization: Lead initiatives in capacity planning, performance tuning, and resource management. Develop strategies and automation to ensure our infrastructure scales efficiently and stays within budget.
  4. Pragmatic automation and AI orchestration: Design and build automation and leverage AI Agents to eliminate operational toil, improve deployment safety, and enhance overall operational efficiency. Focus on creating maintainable, robust tools and intelligent workflows that make the entire team more effective.
  5. Operational excellence and change management: Uphold and improve our standards for production operations, including runbooks, monitoring, and alerting. Vet complex changes and deployments to ensure they meet our bar for production readiness.

Skills

Required

  • Site Reliability Engineering
  • Production Engineering
  • Linux/Unix operating systems
  • networking fundamentals (TCP/IP, DNS)
  • distributed systems
  • Go
  • Python
  • Bash

Nice to have

  • large-scale data infrastructure (e.g., MySQL, Redis, Kafka, Flink)
  • Kubernetes
  • troubleshooting large-scale distributed systems
  • Data Center operations

What the JD emphasized

  • critical production issues
  • production incidents
  • production readiness
  • AI infrastructure