(usa) Principal, Software Engineer

Walmart Walmart · Retail · Sunnyvale, CA

Principal Engineer to architect and lead the development of intelligent, self-healing systems using LLM-based agents for anomaly detection, reasoning across observability data, and automated remediation. The role focuses on building agentic systems for performance and resiliency at enterprise scale, shipping to production, and integrating with observability and vector database stacks.

What you'd actually do

  1. Architect production multi-agent pipelines — from RAG-based knowledge grounding to LLM-driven decision-making and autonomous remediation — operating across 10,500 stores and 240M weekly customers
  2. Own LLM evaluation standards for production: factuality, consistency, safety guardrails, and failure modes; set the bar that other teams adopt
  3. Optimize LLM inference at scale through prompt caching, quantization, and retrieval filtering — measurable latency and cost impact, not theoretical gains
  4. Integrate vector databases and observability stacks to build context-aware systems that act on live signals without human intervention
  5. Build the AI/ML layer that moves Walmart from reactive incident response to predictive, self-correcting infrastructure — cutting mean time to recovery across critical systems
  6. Set the architectural direction for the org's agentic AI platform — from initial design through production deployment — and own the decisions that follow
  7. Close the gap between experimentation and production: move ML models from notebooks into reliable, monitored systems that hold up under Black Friday-scale traffic

Skills

Required

  • 10+ years of experience building and operating distributed systems at scale
  • Proven, hands-on production experience with LLMs, agentic frameworks, or RAG-based systems
  • Deep background in performance engineering, chaos engineering, or SRE — with real ownership of SLOs and incident response
  • Strong programming skills in Python and/or Java; comfort working across the full ML stack

Nice to have

  • Familiarity with ML frameworks: PyTorch, TensorFlow, Hugging Face Transformers
  • Hands-on with cloud-native infrastructure: GCP, Azure, Kubernetes, Docker
  • MLOps experience: CI/CD for ML, drift detection, model monitoring
  • Experimentation background: A/B testing, causal inference, multi-armed bandits
  • Excellent communication skills — able to align technical and non-technical stakeholders on complex architectural decisions

What the JD emphasized

  • ship to production
  • production multi-agent pipelines
  • LLM evaluation standards for production
  • Optimize LLM inference at scale
  • Build the AI/ML layer
  • move ML models from notebooks into reliable, monitored systems

Other signals

  • building agentic systems
  • LLM-based agents
  • detect anomalies
  • reason across observability data
  • trigger automated remediation
  • without waiting for a human in the loop
  • production multi-agent pipelines
  • RAG-based knowledge grounding
  • LLM-driven decision-making
  • autonomous remediation
  • LLM evaluation standards
  • optimize LLM inference at scale
  • vector databases
  • observability stacks
  • context-aware systems
  • predictive, self-correcting infrastructure
  • move ML models from notebooks into reliable, monitored systems