(usa) Site Reliability Operations III

Walmart Walmart · Retail · Bentonville, AR

This role focuses on ensuring the reliability, availability, and performance of mission-critical applications and platforms within Walmart's Supply Chain & Transportation. Responsibilities include incident management, monitoring and alerting, troubleshooting, and applying DevOps best practices. The ideal candidate will partner with engineering and business stakeholders to drive continuous improvement.

What you'd actually do

  1. Lead and contribute to incident triage, diagnosis, and restoration within defined SLAs.
  2. Define and monitor SLIs, SLOs, and KPIs such as availability, latency, MTBF, and MTTR.
  3. Independently troubleshoot application performance and availability issues.
  4. Execute complex application maintenance, corrective, adaptive, and reengineering activities.
  5. Build strong partnerships with engineering, operations, and business stakeholders.

Skills

Required

  • incident management
  • monitoring
  • production support
  • monitoring
  • alerting
  • debugging
  • application performance analysis
  • RCA/RCCA
  • DevOps mindset

Nice to have

  • SRE certification

What the JD emphasized

  • incident management
  • monitoring
  • production support
  • monitoring
  • alerting
  • debugging
  • application performance analysis
  • conduct RCA/RCCA
  • DevOps mindset