Distinguished, Software Engineer -ai/ml Engineer – Agentic Systems

Walmart · Retail · Sunnyvale, CA

Distinguished AI/ML Engineer to lead the technical development of next-generation agentic AI systems and intelligent automation solutions for Walmart's reliability engineering organization. The role involves architecting and implementing ML platforms and autonomous agents to manage change, performance, monitoring, prediction, and issue resolution across Walmart's technology ecosystem, aiming for self-healing and autonomous systems.

What you'd actually do

  1. Architect and develop advanced agentic AI systems that autonomously manage complex reliability engineering workflows, predictive failure analysis, and self-optimization across Walmart’s technology ecosystem.
  2. Design and implement multi-agent orchestration platforms that coordinate autonomous agents for change management, capacity planning, and performance optimization across e-commerce, supply chain, and in-store systems.
  3. Build intelligent observability and monitoring platforms using ML-driven anomaly detection, predictive analytics, and autonomous resolution across Walmart’s entire technology landscape.
  4. Develop self-healing infrastructure platforms that leverage AI to predict, prevent, and automatically remediate system issues before they impact customers, associates, or business operations.
  5. Design, write, and build advanced tools to improve latency, availability, scalability and change management across Walmart Technology systems, including: Engineering reliability using metrics and measurements across all domains Enabling system scaling through technical solutions, automation, and process optimization Building tools and automation to prevent recurrence of failures across mission-critical services Enhancing instrumentation to create a cohesive, end-to-end view of system health with particular focus on failure points

Skills

Required

  • AI/ML
  • Agentic Systems
  • Autonomous Systems
  • Machine Learning Platforms
  • Orchestration
  • Observability
  • Monitoring
  • Anomaly Detection
  • Predictive Analytics
  • Self-healing Infrastructure
  • Reliability Engineering
  • Scalability
  • Automation
  • Distributed Systems
  • Fault-tolerant Systems
  • CI/CD
  • Large Language Models
  • Reinforcement Learning
  • Multi-modal AI
  • Federated Learning

Nice to have

  • NLP
  • Computer Vision

What the JD emphasized

  • agentic AI systems
  • autonomous systems
  • multi-agent orchestration platforms
  • ML-driven anomaly detection
  • predictive analytics
  • autonomous resolution
  • self-healing infrastructure platforms
  • mean time to detect (MTTD)
  • mean time to restore (MTTR)
  • autonomous reliability solutions
  • MLOps and AIOps platforms
  • agentic AI technologies

Other signals

  • AI/ML for reliability engineering
  • agentic AI systems
  • autonomous systems
  • ML-driven anomaly detection
  • predictive analytics
  • self-healing infrastructure