What you'd actually do

Lead and own the full lifecycle of services—from architecture and design through deployment, operations, and continuous optimization—ensuring scalability, reliability, and alignment with business objectives.

Analyze platform-level ITSM performance and proactively establish feedback loops with engineering teams, influencing roadmap prioritization to address systemic gaps and improve resiliency.

Define and drive production readiness standards, including operational design reviews, capacity planning, and launch governance, ensuring services meet reliability and scalability benchmarks before go-live.

Define and evolve monitoring frameworks for availability, latency, and system health, leveraging metrics and telemetry to proactively prevent incidents and improve service performance.

Champion automation-first principles to scale systems efficiently, reducing manual toil while improving deployment velocity and overall system reliability.

Skills

Required

site reliability engineering
infrastructure
DevOps
Linux/UNIX systems
operating systems
database environments (Oracle/SQL, DBA)
observability and monitoring tools (Splunk, Dynatrace)
DevOps and CI/CD practices
programming or scripting languages (Python, Java, Go, C/C++, Perl, or Ruby)
Security and/or Enterprise Monitoring environments
coding and system-level design
designing, analyzing, and troubleshooting large-scale distributed systems
program management capabilities
leading large-scale, cross-functional initiatives
working across development, operations, and product teams
cloud platforms (AWS)
cloud-native architectures
operational best practices

Nice to have

computer science
Engineering
Physics
Mathematics
equivalent practical experience

What the JD emphasized

production readiness steward

developer run ownership

operational design

automation

capacity planning

monitoring

fault-tolerant

scalable products

agile and learning culture

triage

root cause

shift left

proactive

risk management

compliance

risk mitigation

streamlining

standardizing

centralizing points of interaction

communicating effectively

align Product and Customer Focused priorities with Operational needs

run state

feedback loop

customer experience

scalability

reliability

alignment with business objectives

platform-level ITSM performance

feedback loops with engineering teams

roadmap prioritization

systemic gaps

resiliency

production readiness standards

operational design reviews

capacity planning

launch governance

reliability and scalability benchmarks

monitoring frameworks

availability

latency

system health

metrics and telemetry

proactively prevent incidents

service performance

automation-first principles

scale systems efficiently

reducing manual toil

deployment velocity

system reliability

CI/CD pipelines

robust validation

operational gates

best practices

consistency

quality

speed across environments

incident response practices

rapid mitigation

stakeholder communication

blameless postmortems

continuous improvement

resilience

holistic, system-wide approach

critical incidents

collaborate effectively

distributed, global teams

alignment

continuity

high performance

technical leader

mentor

developing junior engineers

promoting best practices

raising the overall bar for engineering excellence

Our Purpose

Mastercard powers economies and empowers people in 200+ countries and territories worldwide. Together with our customers, we’re helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships and networks combine to deliver a unique set of products and services that help people, businesses and governments realize their greatest potential.

Title and Summary

Lead Site Reliability Engineer

Overview: The role of Business Operations Organization is to be the production readiness steward for Mastercard products. As a Business Operations we are responsible for ensuring that our platform is stable and healthy. We break down barriers to run our products by fostering developer run ownership and empowering developers to build resilient products. We support our developers during the application build phase in software run principals that includes operational design, automation, capacity planning, monitoring that leads to fault-tolerant, scalable products. We see the big picture and help create and enforce operations standards while facilitating an agile and learning culture. We accomplish this transformation through supporting daily operations with a hyper focus on triage and then root cause by understanding the business impact of our products. The goal of every biz ops team is to shift left to be more proactive and upfront in the development process, and to proactively manage production and change activities to maximize customer experience and increase the overall value of supported applications. Biz Ops teams also focus on risk management by tying all our activities together with an overarching responsibility for compliance and risk mitigation across all our environments. A biz ops focus is also on streamlining and standardizing traditional application specific support activities and centralizing points of interaction for both internal and external partners by communicating effectively with all key stakeholders.

Ultimately, the role of biz ops is to align Product and Customer Focused priorities with Operational needs. We regularly review our run state not only from an internal perspective but also understanding and providing the feedback loop to our development partners on how we can improve the customer experience of our applications.

Key Responsibilities • Lead and own the full lifecycle of services—from architecture and design through deployment, operations, and continuous optimization—ensuring scalability, reliability, and alignment with business objectives. • Analyze platform-level ITSM performance and proactively establish feedback loops with engineering teams, influencing roadmap prioritization to address systemic gaps and improve resiliency. • Define and drive production readiness standards, including operational design reviews, capacity planning, and launch governance, ensuring services meet reliability and scalability benchmarks before go-live. • Define and evolve monitoring frameworks for availability, latency, and system health, leveraging metrics and telemetry to proactively prevent incidents and improve service performance. • Champion automation-first principles to scale systems efficiently, reducing manual toil while improving deployment velocity and overall system reliability. • Lead the design and governance of CI/CD pipelines, implementing robust validation, operational gates, and best practices to drive consistency, quality, and speed across environments. • Drive best-in-class incident response practices, including rapid mitigation, stakeholder communication, and blameless postmortems, ensuring continuous improvement and resilience. • Take a holistic, system-wide approach during critical incidents, connect • Collaborate effectively across distributed, global teams, ensuring alignment, continuity, and high performance across time zones and technology hubs. • Act as a technical leader and mentor, developing junior engineers, promoting best practices, and raising the overall bar for engineering excellence within the organization.

All about you • Bachelor’s degree in computer science, Engineering, or a related technical field (e.g., Physics, Mathematics), or equivalent practical experience. • 8–15 years of relevant experience in Site Reliability Engineering, Infrastructure, or DevOps roles, with a combination of hands-on technical expertise and early leadership responsibilities. • Strong technical foundation across enterprise platforms, Linux/UNIX systems, operating systems, and database environments (Oracle/SQL, DBA), with the ability to provide technical guidance and support to the team. • Experience with observability and monitoring tools (e.g., Splunk, Dynatrace), driving improved system visibility, performance, and reliability. • Solid experience in DevOps and CI/CD practices, with the ability to support and guide automation, deployment pipelines, and operational improvements. • Proficiency in one or more programming or scripting languages such as Python, Java, Go, C/C++, Perl, or Ruby, with practical application in automation or system • Strong foundation in Security and/or Enterprise Monitoring environments, with exposure to coding and system-level design. • Experience designing, analyzing, and troubleshooting large-scale distributed systems, with a strong focus on reliability, scalability, and performance optimization. • Strong program management capabilities, with a track record of successfully leading large-scale, cross-functional initiatives from concept through execution. • Extensive experience working across development, operations, and product teams to prioritize initiatives, build strong partnerships, and deliver end-to-end solutions. • Practical knowledge of cloud platforms, preferably AWS, with familiarity in cloud-native architectures and operational best practices. • Ability to critically assess existing processes and challenge the status quo, identifying opportunities to improve efficiency, scalability, and overall business impact.

We are seeking site reliability engineers with an appetite for change and who can push the boundaries of what can be completed through automation, while managing service levels for some of Mastercard’s most critical security services.

Corporate Security Responsibility

All activities involving access to Mastercard assets, information, and networks comes with an inherent risk to the organization and, therefore, it is expected that every person working for, or on behalf of, Mastercard is responsible for information security and must:

Abide by Mastercard’s security policies and practices;
Ensure the confidentiality and integrity of the information being accessed;
Report any suspected information security violation or breach, and
Complete all periodic mandatory security trainings in accordance with Mastercard’s guidelines.

Our Purpose

Title and Summary

Lead Site Reliability Engineer

Corporate Security Responsibility

Abide by Mastercard’s security policies and practices;
Ensure the confidentiality and integrity of the information being accessed;
Report any suspected information security violation or breach, and
Complete all periodic mandatory security trainings in accordance with Mastercard’s guidelines.