Site Reliability Engineer - Video Infrastructure

ByteDance · Big Tech · Seattle, WA · R&D

Site Reliability Engineer for ByteDance's Video Cloud Infra team, focusing on building and managing global infrastructure for multimedia transport, storage, and processing. Responsibilities include system management, automation, and ensuring reliability and efficiency of large-scale distributed systems.

What you'd actually do

Build global infrastructure for multi-media transport, storage and process, to serve billions of users all over the world.
Engage in global production system management such as monitoring, emergency response, capacity planning and optimization.
Build tools, automations, visualizations and monitors to facilitate the operation and optimization of the global infrastructure.
Engage in and improve the whole service lifecycle, from inception and design, through deployment, operation and refinement.
Scale up systems sustainably through mechanisms like automation, and initiate changes that improve system reliability and processing speed.

Skills

Required

Bachelor's degree in Computer Science or a related technical background involving software/system engineering, or equivalent working experience.
Extensive knowledge of SRE responsibilities, such as monitoring, incident handling, capacity management and disaster recovery.
Extensive knowledge of networking, operation system, database system and container technology.
Good understanding of every aspect of microservice architecture, and hands on experience in troubleshooting in large scale distributed systems.

Nice to have

Good programming experience with at least one of the following languages: C, C++, Java, Python, or Go.
Hands on experience in common open-source systems such as Linux, MySQL, MongoDB, Redis and ELK and experience in building solutions with AWS,Google Cloud, Azures and other cloud services is a plus.
Passionate, self-motivated, strong ownership and good teamwork skills.

Read full job description

Team Introduction Video Cloud Infra team, facing business experience and cost, builds a competitive video transmission network and multimedia processing platform, builds data foundation and analysis capabilities, drives product refined operation, reduces costs and increases efficiency.

Responsibilities

Build global infrastructure for multi-media transport, storage and process, to serve billions of users all over the world.
Engage in global production system management such as monitoring, emergency response, capacity planning and optimization.
Build tools, automations, visualizations and monitors to facilitate the operation and optimization of the global infrastructure.
Engage in and improve the whole service lifecycle, from inception and design, through deployment, operation and refinement.
Scale up systems sustainably through mechanisms like automation, and initiate changes that improve system reliability and processing speed.

Requirements

Minimum Qualifications:

Bachelor's degree in Computer Science or a related technical background involving software/system engineering, or equivalent working experience.
Extensive knowledge of SRE responsibilities, such as monitoring, incident handling, capacity management and disaster recovery.
Extensive knowledge of networking, operation system, database system and container technology.
Good understanding of every aspect of microservice architecture, and hands on experience in troubleshooting in large scale distributed systems.

Preferred Qualifications:

Good programming experience with at least one of the following languages: C, C++, Java, Python, or Go.
Hands on experience in common open-source systems such as Linux, MySQL, MongoDB, Redis and ELK and experience in building solutions with AWS,Google Cloud, Azures and other cloud services is a plus.
Passionate, self-motivated, strong ownership and good teamwork skills.

Responsibilities

Build global infrastructure for multi-media transport, storage and process, to serve billions of users all over the world.
Engage in global production system management such as monitoring, emergency response, capacity planning and optimization.
Build tools, automations, visualizations and monitors to facilitate the operation and optimization of the global infrastructure.
Engage in and improve the whole service lifecycle, from inception and design, through deployment, operation and refinement.
Scale up systems sustainably through mechanisms like automation, and initiate changes that improve system reliability and processing speed.

Requirements

Minimum Qualifications:

Bachelor's degree in Computer Science or a related technical background involving software/system engineering, or equivalent working experience.
Extensive knowledge of SRE responsibilities, such as monitoring, incident handling, capacity management and disaster recovery.
Extensive knowledge of networking, operation system, database system and container technology.
Good understanding of every aspect of microservice architecture, and hands on experience in troubleshooting in large scale distributed systems.

Preferred Qualifications:

Good programming experience with at least one of the following languages: C, C++, Java, Python, or Go.
Hands on experience in common open-source systems such as Linux, MySQL, MongoDB, Redis and ELK and experience in building solutions with AWS,Google Cloud, Azures and other cloud services is a plus.
Passionate, self-motivated, strong ownership and good teamwork skills.