Lead Site Reliability Engineer

Mastercard Mastercard · Fintech · O Fallon, MO +1 · Engineering

Lead Site Reliability Engineer responsible for ensuring the reliability, scalability, and performance of Java, Spring Framework applications. Focuses on production readiness, operational design, automation, capacity planning, monitoring, triage, root cause analysis, and compliance. Fosters developer ownership and resilience, while aligning product priorities with operational needs.

What you'd actually do

  1. Ensure application health, performance, and capacity; support pre-launch activities.
  2. Design resilient infrastructure, perform root cause analysis, and automate alerts.
  3. Improve service lifecycle, support CI/CD pipelines, and reduce manual intervention.
  4. Analyze platform activities and provide feedback to development teams.

Skills

Required

  • Java
  • Spring Framework
  • Python
  • Go
  • DevOps
  • configuration management
  • distributed systems
  • automation
  • problem-solving
  • communication
  • leadership
  • observability
  • continuous improvement

What the JD emphasized

  • production readiness steward
  • ensuring that our platform is stable and healthy
  • fostering developer run ownership
  • empowering developers to build resilient products
  • operational design, automation, capacity planning, monitoring
  • fault-tolerant, scalable products
  • agile and learning culture
  • ensuring the reliability, scalability, and performance of our Java, Spring Framework applications
  • triage, root cause by understanding the business impact
  • blameless post-mortems
  • proactive and upfront in the development process
  • proactively manage production and change activities
  • risk management by tying all our activities together with an overarching responsibility for compliance and risk mitigation
  • align Product and Customer Focused priorities with Operational needs
  • Operational Readiness Architect
  • Site Reliability Engineering
  • DevOps/Automation
  • ITSM Practices
  • distributed systems and automation
  • observability and continuous improvement