Position Summary...

Walmart processes more transactions in a day than most companies handle in a year. When performance degrades or systems fail, the impact is immediate — measured in millions of dollars and hundreds of millions of customers. We're building the team that prevents that using agentic AI.

As a Principal Engineer in Performance and Resiliency Engineering, you'll architect and lead the development of intelligent, self-healing systems: LLM-based agents that detect anomalies, reason across observability data, and trigger automated remediation — without waiting for a human in the loop. You'll operate at a scale most AI engineers never encounter: 10,500 stores, 240M weekly customers, and infrastructure that powers one of the world's largest retail ecosystems.

This isn't a research role or a proof-of-concept environment. You'll own the technical strategy, set architectural direction, and ship to production — building agentic systems that directly impact Walmart's global reliability and business continuity.

About the Team

Building the right technology foundation for Infrastructure & Platforms is vital to success at Walmart's scale. Our team builds and maintains the foundational technologies that power the entire tech organization — data platforms, enterprise architecture, DevOps, cloud computing, and infrastructure. We ship to production weekly, run blameless postmortems, and treat chaos experiments as first-class engineering work. If you thrive in high-ownership environments where your architectural decisions have immediate, measurable impact, this is where you belong.

What you'll do...

What You'll Own You'll set the technical direction — not just execute it. From initial architecture through production deployment, you'll own the roadmap for Walmart's agentic AI platform for performance and resiliency. You'll have the autonomy to make architectural tradeoffs, drive experimentation, and shape how intelligent systems operate at enterprise scale. Key Responsibilities Build & Lead Agentic AI Systems

Architect production multi-agent pipelines — from RAG-based knowledge grounding to LLM-driven decision-making and autonomous remediation — operating across 10,500 stores and 240M weekly customers
Own LLM evaluation standards for production: factuality, consistency, safety guardrails, and failure modes; set the bar that other teams adopt
Optimize LLM inference at scale through prompt caching, quantization, and retrieval filtering — measurable latency and cost impact, not theoretical gains
Integrate vector databases and observability stacks to build context-aware systems that act on live signals without human intervention

Drive Performance & Resiliency

Build the AI/ML layer that moves Walmart from reactive incident response to predictive, self-correcting infrastructure — cutting mean time to recovery across critical systems
Design and run chaos experiments that expose real failure modes and change architecture decisions — not checkbox exercises
Define SLOs that reflect real business impact, integrate performance gates into CI/CD, and make observability (Grafana, Prometheus, ELK, Splunk) actionable across the org
Write and maintain runbooks that teams actually use: tested, updated after every incident, and clear enough to act on under pressure

Lead & Elevate Engineering

Set the architectural direction for the org's agentic AI platform — from initial design through production deployment — and own the decisions that follow
Close the gap between experimentation and production: move ML models from notebooks into reliable, monitored systems that hold up under Black Friday-scale traffic
Raise the technical floor through design reviews and mentoring that produces engineers who make better decisions independently
Shape the multi-year roadmap for AI-powered performance and resiliency, influencing infrastructure investment decisions across the org

What You'll Bring Core Requirements

10+ years of experience building and operating distributed systems at scale
Proven, hands-on production experience with LLMs, agentic frameworks, or RAG-based systems
Deep background in performance engineering, chaos engineering, or SRE — with real ownership of SLOs and incident response
Strong programming skills in Python and/or Java; comfort working across the full ML stack

Additional Experience (Valued)

Familiarity with ML frameworks: PyTorch, TensorFlow, Hugging Face Transformers
Hands-on with cloud-native infrastructure: GCP, Azure, Kubernetes, Docker
MLOps experience: CI/CD for ML, drift detection, model monitoring
Experimentation background: A/B testing, causal inference, multi-armed bandits

Excellent communication skills — able to align technical and non-technical stakeholders on complex architectural decisions At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more. You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable. For information about PTO, see https://one.walmart.com/notices. Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart. Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms. For information about benefits and eligibility, see One.Walmart. The annual salary range for this position is $143,000.00 - $286,000.00 Additional compensation includes annual or quarterly performance bonuses. Additional compensation for certain positions may also include :

Stock

ㅤ

‎

Minimum Qualifications...

__Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications. __

Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 5 years’ experience in software engineering or related area. Option 2: 7 years’ experience in software engineering or related area.

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Master’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years' experience in software engineering or related area., We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart’s accessibility standards and guidelines for supporting an inclusive culture.

Primary Location...

1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America

Walmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment.

Position Summary...

About the Team

What you'll do...

Architect production multi-agent pipelines — from RAG-based knowledge grounding to LLM-driven decision-making and autonomous remediation — operating across 10,500 stores and 240M weekly customers
Own LLM evaluation standards for production: factuality, consistency, safety guardrails, and failure modes; set the bar that other teams adopt
Optimize LLM inference at scale through prompt caching, quantization, and retrieval filtering — measurable latency and cost impact, not theoretical gains
Integrate vector databases and observability stacks to build context-aware systems that act on live signals without human intervention

Drive Performance & Resiliency

Build the AI/ML layer that moves Walmart from reactive incident response to predictive, self-correcting infrastructure — cutting mean time to recovery across critical systems
Design and run chaos experiments that expose real failure modes and change architecture decisions — not checkbox exercises
Define SLOs that reflect real business impact, integrate performance gates into CI/CD, and make observability (Grafana, Prometheus, ELK, Splunk) actionable across the org
Write and maintain runbooks that teams actually use: tested, updated after every incident, and clear enough to act on under pressure

Lead & Elevate Engineering

Set the architectural direction for the org's agentic AI platform — from initial design through production deployment — and own the decisions that follow
Close the gap between experimentation and production: move ML models from notebooks into reliable, monitored systems that hold up under Black Friday-scale traffic
Raise the technical floor through design reviews and mentoring that produces engineers who make better decisions independently
Shape the multi-year roadmap for AI-powered performance and resiliency, influencing infrastructure investment decisions across the org

What You'll Bring Core Requirements

10+ years of experience building and operating distributed systems at scale
Proven, hands-on production experience with LLMs, agentic frameworks, or RAG-based systems
Deep background in performance engineering, chaos engineering, or SRE — with real ownership of SLOs and incident response
Strong programming skills in Python and/or Java; comfort working across the full ML stack

Additional Experience (Valued)

Familiarity with ML frameworks: PyTorch, TensorFlow, Hugging Face Transformers
Hands-on with cloud-native infrastructure: GCP, Azure, Kubernetes, Docker
MLOps experience: CI/CD for ML, drift detection, model monitoring
Experimentation background: A/B testing, causal inference, multi-armed bandits

Stock

ㅤ

‎

Minimum Qualifications...

__Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications. __

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Primary Location...

1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America

(usa) Principal, Software Engineer

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Position Summary...

What you'll do...

Minimum Qualifications...

Preferred Qualifications...

Primary Location...

Position Summary...

What you'll do...

Minimum Qualifications...

Preferred Qualifications...

Primary Location...