(usa) Distinguished, Software Engineer-ai/ml Engineer - Agentic Systems & Site Reliability Engineering

Walmart Walmart · Retail · Sunnyvale, CA

Distinguished Software Engineer-AI/ML Engineer role focused on leading the technical development of next-generation agentic AI systems and intelligent automation solutions for Walmart's Site Reliability Engineering organization. The role involves architecting and implementing ML platforms and autonomous agents to ensure reliability, scalability, and operational excellence across Walmart's technology ecosystem, including predictive failure analysis, self-healing infrastructure, and automated incident response.

What you'd actually do

  1. Architect and develop advanced agentic AI systems that can autonomously handle complex reliability engineering workflows, predictive failure analysis, and self-optimization across all Walmart technology systems.
  2. Design and implement multi-agent orchestration platforms that coordinate between different AI agents for automated incident response, capacity planning, and performance optimization across e-commerce, supply chain, and in-store systems.
  3. Build intelligent observability and monitoring systems using ML-driven anomaly detection, predictive analytics, and autonomous incident resolution capabilities that span all of Walmart's technology ecosystem.
  4. Develop self-healing infrastructure platforms that leverage AI to predict, prevent, and automatically resolve system issues before they impact customers, associates, or business operations across any Walmart system.
  5. Design, write and build advanced tools to improve reliability, latency, availability, and scalability of all Walmart Tech systems including: 1) Engineer reliability and availability starting with metrics and measurements across all domains, 2) Enable scaling by providing technical solutions, developing automation and/or optimizing processes for all engineering teams, 3) Build tools/automate to prevent re-occurrence of problems across all mission critical Walmart services, 4) Augment existing instrumentation to build a cohesive picture of system characteristics across the entire Walmart technology landscape with special attention to points of failure.

Skills

Required

  • AI/ML
  • Agentic systems
  • Autonomous systems
  • Site Reliability Engineering (SRE)
  • Machine learning platforms
  • Orchestration platforms
  • Observability and monitoring systems
  • Anomaly detection
  • Predictive analytics
  • Self-healing infrastructure
  • Reliability, latency, availability, and scalability engineering
  • Fault-tolerant systems design
  • Distributed systems
  • Troubleshooting and analysis
  • Machine learning
  • Natural language processing
  • Computer vision

Nice to have

  • LLM
  • Large-scale distributed systems
  • Hybrid cloud infrastructure

What the JD emphasized

  • lead the technical development
  • ensure mission-critical reliability, scalability, and operational excellence
  • architect and implement cutting-edge machine learning platforms and autonomous agents
  • revolutionize how we monitor, predict, and automatically resolve issues
  • technical ownership for reliability, scalability, automation, and mission-critical issues
  • drive the transformation of traditional SRE practices into AI-powered, self-healing, and autonomous systems
  • designing and building Tier 0 high-availability, resilient agentic platforms
  • define and implement unified, intelligent, operationally robust technical solutions
  • ensure the reliability, availability, and performance of all systems
  • building autonomous systems that can predict, prevent, and resolve issues
  • ensure that every system meets the highest standards of reliability, scalability, and performance
  • building a robust, intelligent, and highly automated infrastructure
  • Architect and develop advanced agentic AI systems
  • Design and implement multi-agent orchestration platforms
  • Build intelligent observability and monitoring systems
  • Develop self-healing infrastructure platforms
  • Design, write and build advanced tools to improve reliability, latency, availability, and scalability
  • Engineer reliability and availability starting with metrics and measurements
  • Enable scaling by providing technical solutions, developing automation and/or optimizing processes
  • Build tools/automate to prevent re-occurrence of problems
  • Augment existing instrumentation to build a cohesive picture of system characteristics
  • Architect and implement fault-tolerant systems and services
  • autonomous recovery and intelligent failure prediction
  • Collaborate with engineering teams and leadership
  • establish technical strategies and solutions to improve mean time to detect (MTTD) and mean time to restore (MTTR)
  • intelligent automation and predictive capabilities
  • define SLOs and build SLIs
  • Perform complex troubleshooting and analysis of large-scale distributed systems
  • Partner closely with all engineering organizations
  • deliver autonomous reliability solutions
  • advanced machine learning, natural language processing, and computer vision technologies

Other signals

  • AI/ML for SRE
  • Agentic systems
  • Autonomous systems
  • Reliability engineering