Site Reliability Engineer - Data (seattle)

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Site Reliability Engineer focused on data infrastructure, including applied machine learning (AML) support, data center, data infrastructure, and recommendation systems. Responsibilities include service lifecycle management, designing monitoring frameworks, developing cloud-managed data infrastructure components (Kubernetes, Redis, MySQL, Flink), and scaling systems through automation. Requires a Bachelor's degree, 3+ years of programming experience, and familiarity with Unix/Linux, networking, and distributed systems.

What you'd actually do

  1. Participate in and enhance the complete service lifecycle, from inception and design, through development, capacity planning, launch reviews, deployment, operation, and refinement.
  2. Design and implement software platforms and monitoring frameworks to govern service-oriented architecture (SOA) efficiently, automatically, and intelligently.
  3. Develop and manage components of cloud-managed data infrastructure, encompassing technologies such as Kubernetes, Redis, MySQL, Flink, and more.
  4. Establish sustainable mechanisms for scaling systems, such as automation, to drive enhancements in reliability, efficiency, and velocity.
  5. Provide sustainable user support, manage incident responses, and conduct blameless postmortems as part of our ongoing efforts to improve our systems.

Skills

Required

  • Bachelor's degree in Computer Science or a related technical field
  • 3 years of experience programming in C, C++, Java, Python, Go, or Rust
  • Familiar with Unix/Linux system internals
  • Familiar with networking
  • Familiar with distributed systems

Nice to have

  • MySQL
  • Redis
  • Ngnix
  • Kubernetes
  • Docker
  • OpenStack
  • Hadoop
  • Spark
  • Flink
  • designing large-scale distributed systems
  • analyzing large-scale distributed systems
  • problem solving
  • communication