Lead Site Reliability Engineer

Mastercard Mastercard · Fintech · O Fallon, MO +1 · Engineering

Lead Site Reliability Engineer responsible for driving SRE and DevOps maturity, shaping reliability strategy, defining standards, and elevating operational excellence across critical platforms. Focuses on resilience, scalability, and customer trust, partnering with engineering, architecture, and security teams. Ensures platforms are highly available, observable, self-healing, secure, compliant, and operated through repeatable processes. Influences architecture, design, and delivery to embed reliability and operability, with a shift-left operational mindset. Owns availability, latency, performance, and reliability objectives, leads incident response, and champions blameless postmortems. Drives CI/CD strategy, automation adoption, and defines standards for monitoring and alerting. Partners with security and compliance teams to embed controls and ensure regulatory requirements are met. Mentors engineers and contributes to best practices.

What you'd actually do

  1. Act as a Lead-level technical authority for reliability, operability, and production readiness across multiple platforms or programs.
  2. Own and evolve availability, latency, performance, and reliability objectives for critical systems.
  3. Provide leadership for CI/CD strategy, ensuring pipelines support automated validation, risk-based gating, and safe, repeatable deployments.
  4. Define and promote standards for monitoring, alerting, SLOs, and telemetry.
  5. Partner with security, risk, and compliance teams to embed controls, auditability, and regulatory requirements into platform design and operations.

Skills

Required

  • distributed systems
  • reliability engineering
  • production operations
  • algorithms
  • data structures
  • system design
  • automation
  • troubleshooting
  • Python
  • Go
  • Bash
  • DevOps tooling
  • observability tooling
  • CI/CD pipelines

Nice to have

  • Certificate Management
  • PKI
  • Authentication

What the JD emphasized

  • operational risk
  • reliability
  • scalability
  • customer trust
  • resilience
  • automation
  • observability
  • compliance
  • auditable
  • risk management
  • controls
  • regulatory requirements