What you'd actually do

Strong hands-on skills in the design, development, and operation of large-scale cloud infrastructure and distributed systems.

Collaborate with cross-functional teams (e.g., Advertising, Machine Learning, E-commerce, and Core Infra) to drive system reliability, performance, and scalability.

Lead initiatives to automate operations, eliminate toil, and improve overall system efficiency.

Troubleshoot complex production issues, perform root-cause analysis, and drive long-term reliability improvements.

Promote best practices in system design, observability, performance optimization, and cost efficiency.

Skills

Required

Site Reliability Engineering
Software Development
Cloud-based systems design, building, scaling, and operation
Databases (SQL/NoSQL)
Kubernetes or container orchestration
Big Data processing and storage systems (streaming and batch)
System architecture
Distributed systems
Performance bottlenecks analysis
Communication skills
Collaboration skills

Nice to have

Automation
Tooling
Process improvements
Cost optimization
Performance tuning
Data-driven decision making
Adopting new technologies
Improving operational practices
Influencing system design

What the JD emphasized

5+ years of experience in Site Reliability Engineering, Software Development, or related fields, with a strong focus on designing, building, scaling, and operating cloud-based systems.

Deep hands-on expertise in at least one of the following areas: Databases (SQL/NoSQL), Kubernetes or container orchestration, Big Data processing and storage systems (streaming and batch)

Strong knowledge of system architecture, distributed systems, and performance bottlenecks.

Team Introduction: Our Site Reliability Engineering (SRE) team blends software and systems engineering to build and operate large-scale data infrastructure with high reliability and efficiency. We provide a dependable cloud environment that powers our global business. In this role, you will leverage your expertise in data center architecture, data infrastructure services, and systems and tools development to solve complex scaling and reliability challenges.

We’re looking for a Technical Lead (SRE) who can provide deep technical leadership, drive architectural improvements, and collaborate effectively across multiple organizations. You’ll partner with engineering, product, data, and infrastructure teams to deliver resilient, scalable platforms. This is a highly technical, hands-on role that requires strong problem-solving ability, clear communication, and the ability to influence without formal authority.

Responsibilities

Strong hands-on skills in the design, development, and operation of large-scale cloud infrastructure and distributed systems.
Collaborate with cross-functional teams (e.g., Advertising, Machine Learning, E-commerce, and Core Infra) to drive system reliability, performance, and scalability.
Lead initiatives to automate operations, eliminate toil, and improve overall system efficiency.
Troubleshoot complex production issues, perform root-cause analysis, and drive long-term reliability improvements.
Promote best practices in system design, observability, performance optimization, and cost efficiency.
Communicate complex technical concepts effectively to both technical and non-technical stakeholders.

Requirements

Minimum Qualifications

5+ years of experience in Site Reliability Engineering, Software Development, or related fields, with a strong focus on designing, building, scaling, and operating cloud-based systems.
Deep hands-on expertise in at least one of the following areas:
Databases (SQL/NoSQL)
Kubernetes or container orchestration
Big Data processing and storage systems (streaming and batch)
Strong knowledge of system architecture, distributed systems, and performance bottlenecks.
Excellent communication and collaboration skills, with experience working across engineering, product, and data science teams.

Preferred Qualifications

Proven track record of driving automation, tooling, and process improvements that enhance reliability and efficiency.
Experience in cost optimization and performance tuning at scale, backed by data-driven decision making.
Thought leadership in adopting new technologies, improving operational practices, and influencing system design.

Responsibilities

Strong hands-on skills in the design, development, and operation of large-scale cloud infrastructure and distributed systems.
Collaborate with cross-functional teams (e.g., Advertising, Machine Learning, E-commerce, and Core Infra) to drive system reliability, performance, and scalability.
Lead initiatives to automate operations, eliminate toil, and improve overall system efficiency.
Troubleshoot complex production issues, perform root-cause analysis, and drive long-term reliability improvements.
Promote best practices in system design, observability, performance optimization, and cost efficiency.
Communicate complex technical concepts effectively to both technical and non-technical stakeholders.

Requirements

Minimum Qualifications

5+ years of experience in Site Reliability Engineering, Software Development, or related fields, with a strong focus on designing, building, scaling, and operating cloud-based systems.
Deep hands-on expertise in at least one of the following areas:
Databases (SQL/NoSQL)
Kubernetes or container orchestration
Big Data processing and storage systems (streaming and batch)
Strong knowledge of system architecture, distributed systems, and performance bottlenecks.
Excellent communication and collaboration skills, with experience working across engineering, product, and data science teams.

Preferred Qualifications

Proven track record of driving automation, tooling, and process improvements that enhance reliability and efficiency.
Experience in cost optimization and performance tuning at scale, backed by data-driven decision making.
Thought leadership in adopting new technologies, improving operational practices, and influencing system design.