Site Reliability Engineer II

Mastercard Mastercard · Fintech · O Fallon, MO +1 · Engineering

Mastercard is seeking a Site Reliability Engineer II to ensure the reliability, scalability, and performance of their applications supporting global operations. The role involves being a production readiness steward, fostering developer ownership, supporting the build phase with operational design, automation, capacity planning, and monitoring. Responsibilities include daily operations with a focus on triage, root cause analysis, blameless post-mortems, risk management, compliance, and aligning product priorities with operational needs. The engineer will work independently on projects, support high-availability systems, assist in evaluating operational needs, contribute to automation, troubleshoot system issues, document procedures, and participate in quality checks.

What you'd actually do

  1. Work independently on elements of projects/processes within the Site Reliability Engineering area by applying intermediate/practical knowledge and area best practices to meet organizational standards of quality and excellence.
  2. Support the implementation and maintenance of high-availability systems to ensure operational stability.
  3. Assist in evaluating operational needs and developing technical solutions under guidance.
  4. Contribute to automation and scripting projects to streamline routine operational tasks.
  5. Troubleshoot and resolve basic to moderate system issues, escalating more complex problems as needed.
  6. Document operational procedures and shares knowledge with team members.
  7. Participate in quality checks and reviews to ensure system stability and reliability.
  8. Utilize experience and a comprehensive understanding of area processes and tools to make minor adjustments or enhancements to resolve identifiable issues. May manage smaller project/initiatives as an experienced individual contributor with specialized knowledge within the Site Reliability Engineering area.

Skills

Required

  • Observability - Ability to use scripting and tooling to implement observability solutions, enabling the collection, analysis, and visualization of metrics, logs, and traces to support incident detection, diagnosis, and continuous service improvement.
  • Programming and Scripting - Ability to write and maintain code and scripts to automate tasks, build operational tools, and support monitoring, deployment, and incident response using languages such as Python, Go, Bash, or similar.
  • Systems and Network Administration - Ability to configure, operate, and troubleshoot Linux/Unix systems and network components, applying knowledge of networking concepts, protocols, security, and system reliability.
  • Cloud Computing and Infrastructure - Ability to design, deploy, and manage applications and infrastructure on cloud platforms (e.g., AWS, Azure, GCP), ensuring scalability, security, availability, and operational efficiency.
  • Reliability and Scalability - Ability to design and operate systems for high availability, fault tolerance, and disaster recovery, while ensuring systems can scale to meet current and future demand
  • DevOps Practices