Senior Site Reliability Engineer

Mastercard Mastercard · Fintech · Mexico City, Mexico · Engineering

Senior Site Reliability Engineer for the Xborder team, focusing on solving problems, implementing automation, and leveraging best practices to ensure the reliability, scalability, and performance of the platform. This role involves engaging in the full lifecycle of services, from design to operation, and embedding SRE principles into the delivery process.

What you'd actually do

  1. Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation, and refinement.
  2. Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.
  3. Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.
  4. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  5. Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.

Skills

Required

  • UNIX/Linux systems
  • scripting and automation
  • DevOps practices
  • CI/CD pipelines
  • operating systems
  • platforms
  • infrastructure components
  • ITSM processes (Change and Problem Management)
  • observability and monitoring tools (Splunk, Dynatrace)
  • analytical skills
  • problem-solving skills
  • planning skills
  • communication skills
  • ability to work independently
  • collaboration skills
  • relationship-building skills
  • customer service skills

Nice to have

  • C
  • C++
  • Java
  • Python
  • Go
  • Perl
  • Ruby
  • Knowledge of Artificial Intelligence Use cases and Implementation

What the JD emphasized

  • production readiness owner
  • reliability
  • scalability
  • performance
  • availability
  • capacity
  • performance
  • observability
  • self-healing
  • deployment automation
  • operational excellence
  • production event
  • mean time to recover
  • production readiness
  • operational gaps
  • resiliency concerns
  • system design consulting
  • capacity planning
  • launch reviews
  • system health
  • scale systems sustainably
  • evolve systems
  • reliability
  • velocity
  • incident response
  • blameless postmortems
  • holistic approach
  • connecting the dots
  • technology stack
  • optimize mean time to recover
  • global team
  • tech hubs
  • multiple geographies
  • time zones
  • share knowledge
  • mentor junior resources
  • hands-on experience
  • UNIX/Linux systems
  • scripting and automation
  • DevOps practices
  • CI/CD pipelines
  • operating systems
  • platforms
  • infrastructure components
  • ITSM processes
  • Change and Problem Management
  • observability and monitoring tools
  • analytical
  • problem-solving
  • planning skills
  • manage multiple priorities
  • work effectively under pressure
  • communication skills
  • work independently
  • minimal supervision
  • collaboration
  • relationship-building
  • customer service skills