Lead Site Reliability Engineer

Mastercard Mastercard · Fintech · Mexico City, Mexico · Engineering

Lead Site Reliability Engineer responsible for ensuring the production readiness, stability, and health of Mastercard products. This role focuses on fostering developer ownership, implementing operational design, automation, capacity planning, and monitoring to build fault-tolerant, scalable, and resilient products. The team aims to shift left in the development process, proactively manage production, and mitigate risks while ensuring compliance. Key responsibilities include leading the full lifecycle of services, analyzing ITSM performance, defining production readiness standards, evolving monitoring frameworks, championing automation, driving CI/CD pipelines, leading incident response, and mentoring junior engineers.

What you'd actually do

  1. Lead and own the full lifecycle of services—from architecture and design through deployment, operations, and continuous optimization—ensuring scalability, reliability, and alignment with business objectives.
  2. Analyze platform-level ITSM performance and proactively establish feedback loops with engineering teams, influencing roadmap prioritization to address systemic gaps and improve resiliency.
  3. Define and drive production readiness standards, including operational design reviews, capacity planning, and launch governance, ensuring services meet reliability and scalability benchmarks before go-live.
  4. Define and evolve monitoring frameworks for availability, latency, and system health, leveraging metrics and telemetry to proactively prevent incidents and improve service performance.
  5. Champion automation-first principles to scale systems efficiently, reducing manual toil while improving deployment velocity and overall system reliability.

Skills

Required

  • site reliability engineering
  • infrastructure
  • DevOps
  • Linux/UNIX systems
  • operating systems
  • database environments (Oracle/SQL, DBA)
  • observability and monitoring tools (Splunk, Dynatrace)
  • DevOps and CI/CD practices
  • programming or scripting languages (Python, Java, Go, C/C++, Perl, or Ruby)
  • Security and/or Enterprise Monitoring environments
  • coding and system-level design
  • designing, analyzing, and troubleshooting large-scale distributed systems
  • program management capabilities
  • leading large-scale, cross-functional initiatives
  • working across development, operations, and product teams
  • cloud platforms (AWS)
  • cloud-native architectures
  • operational best practices

Nice to have

  • computer science
  • Engineering
  • Physics
  • Mathematics
  • equivalent practical experience

What the JD emphasized

  • production readiness steward
  • developer run ownership
  • operational design
  • automation
  • capacity planning
  • monitoring
  • fault-tolerant
  • scalable products
  • agile and learning culture
  • triage
  • root cause
  • shift left
  • proactive
  • risk management
  • compliance
  • risk mitigation
  • streamlining
  • standardizing
  • centralizing points of interaction
  • communicating effectively
  • align Product and Customer Focused priorities with Operational needs
  • run state
  • feedback loop
  • customer experience
  • scalability
  • reliability
  • alignment with business objectives
  • platform-level ITSM performance
  • feedback loops with engineering teams
  • roadmap prioritization
  • systemic gaps
  • resiliency
  • production readiness standards
  • operational design reviews
  • capacity planning
  • launch governance
  • reliability and scalability benchmarks
  • monitoring frameworks
  • availability
  • latency
  • system health
  • metrics and telemetry
  • proactively prevent incidents
  • service performance
  • automation-first principles
  • scale systems efficiently
  • reducing manual toil
  • deployment velocity
  • system reliability
  • CI/CD pipelines
  • robust validation
  • operational gates
  • best practices
  • consistency
  • quality
  • speed across environments
  • incident response practices
  • rapid mitigation
  • stakeholder communication
  • blameless postmortems
  • continuous improvement
  • resilience
  • holistic, system-wide approach
  • critical incidents
  • collaborate effectively
  • distributed, global teams
  • alignment
  • continuity
  • high performance
  • technical leader
  • mentor
  • developing junior engineers
  • promoting best practices
  • raising the overall bar for engineering excellence