Position Summary...

Building the right technology foundation for Infrastructure & platforms is vital to success at the scale of Walmart. Our team builds and maintains the foundational technologies that support the tech organization. Included in this are data platforms, enterprise architecture, DevOps, cloud computing, and infrastructure. All of these products and services are supported by scalable and powerful infrastructure, ensuring a secure and seamless employee and customer experience across stores, digital channels, and distribution centers.

What you'll do...

Walmart Global Tech's Site Reliability Engineering organization is built with hybrid systems and software engineers who take technical ownership for reliability, scalability, automation, and mission-critical issues related to uptime, availability and fast rate of improvement of Walmart's e-commerce, stores, and omni-channel platform. As a technical expert in this domain, you'll drive the transformation of traditional SRE practices into AI-powered, self-healing, and autonomous systems built on modern tech stacks with intelligent capacity management and predictive performance optimization. You'll be responsible for designing and building Tier 0 high-availability, resilient agentic platforms that serve as the backbone for reliability engineering across all of Walmart's systems, stores and facilities across US and international markets while defining and implementing unified, intelligent, operationally robust technical solutions and tools for all Walmart Technology organizations across all channels and geographies. What you'll do:

Design, write and build tools to improve the reliability, latency, availability, and scalability of Walmart Tech stack including 1) Engender reliability and availability starting with metrics and measurements. 2) Enable scaling by providing tools, developing training and/or augmenting processes. 3) Build tools/automate to prevent re-occurrence of problem to mission critical products/services. 4) Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.
Drive team to build and scale fault-tolerant system and services in our hybrid cloud infrastructure.
Partner with leadership across organization to establish strategic plans and objectives to improve the mean time to detect and mean time to restore.
Collaborate with Service owners to define the SLOs and build SLIs to ensure systems are meeting the SLAs

What you'll bring: You will be responsible as a Director in Reliability Engineering and Operations team to ensure that critical parts of Walmart’s business are prepared for known events and to address any contingency. You’ll have opportunity to manage the complex challenges of micro service and scale which are unique to Walmart’s e-commerce, stores, and omni-channel platform, while using your expertise in coding, algorithms, complex triaging and analysis, and large-scale system design. You’ll excel if you have enthusiasm to dig deep and a flare for sharp technical communication, prioritization for uptime/availability and organization. To do so, you will need strong skills in following areas:

Design, write and build tools to improve the reliability, latency, availability, and scalability of Walmart Tech stack including 1) Engender reliability and availability starting with metrics and measurements. 2) Enable scaling by providing tools, developing training and/or augmenting processes. 3) Build tools/automate to prevent re-occurrence of problem to mission critical products/services. 4) Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.
Drive team to build and scale fault-tolerant system and services in our hybrid cloud infrastructure.
Partner with leadership across organization to establish strategic plans and objectives to improve the mean time to detect and mean time to restore.
Collaborate with Service owners to define the SLOs and build SLIs to ensure systems are meeting the SLAs
Expert-level AI/ML engineering experience with deep expertise in machine learning algorithms, deep learning frameworks (TensorFlow, PyTorch), and production ML system deployment at scale.
Advanced experience with agentic AI systems including multi-agent frameworks, autonomous decision-making systems, LLM-based agents, and agent orchestration platforms.
Comprehensive Site Reliability Engineering expertise including hands-on experience with Service Management (Incident, Problem & Change Management), Performance and Capacity Engineering for AI/ML systems.
Expert-level cloud engineering experience (Azure, GCP, AWS) with deep knowledge of cloud-native AI/ML services, containerization (Kubernetes, Docker), and serverless architectures.
Deep observability and monitoring expertise with hands-on experience in:
- Distributed tracing (Jaeger, Zipkin, OpenTelemetry) for AI/ML pipelines
- Metrics collection and alerting (Prometheus, Grafana, DataDog) with ML-specific dashboards
- Log aggregation and analysis (ELK stack, Splunk, Fluentd) for model and system monitoring
- APM tools and performance monitoring for AI/ML workloads
- AI-driven anomaly detection and predictive monitoring systems
Platform Engineering experience including:
- Building developer platforms and internal tooling for AI/ML teams
- Infrastructure as Code (Terraform, CloudFormation, Pulumi)
- Service mesh architectures (Istio, Linkerd) for AI/ML services
- API gateway and microservices platform development
- Self-service ML deployment platforms and developer productivity tools
Industry & Domain Experience including:
- Experience in large-scale retail, e-commerce, or high-traffic consumer-facing systems with strict availability and performance requirements (strongly preferred).
- Experience with mission-critical distributed systems serving millions of concurrent users across multiple domains (e-commerce, payments, inventory, supply chain, etc.).
- Experience with enterprise-scale SRE implementations supporting diverse technology stacks and business-critical applications across multiple organizational domains.
- Experience with complex multi-cloud and hybrid cloud environments supporting diverse workloads with varying reliability and performance requirements.
- Technical Leadership & Collaboration Skills:
- Technical thought leadership and influence in AI/ML architecture decisions, SRE methodologies, and platform engineering strategies across all Walmart technology domains.
- Strong cross-functional collaboration experience working with diverse engineering teams across E-commerce, Supply Chain, Store Technology, Fintech, Security, and Platform Engineering to deliver enterprise-wide reliability solutions.
- Excellent technical communication skills with ability to articulate complex SRE and AI/ML concepts to diverse engineering audiences and influence technical decisions across multiple organizations.
- Mentorship and knowledge sharing experience, providing technical guidance on SRE best practices, AI/ML for reliability, and platform engineering through code reviews, technical discussions, and documentation.
- High degree of technical ownership and accountability for complex, mission-critical reliability systems with ability to work independently on high-impact projects that span multiple engineering domains.

About Walmart Global Tech Imagine working in an environment where one line of code can make life easier for hundreds of millions of people. That’s what we do at Walmart Global Tech. We’re a team of software engineers, data scientists, cybersecurity expert's and service professionals within the world’s leading retailer who make an epic impact and are at the forefront of the next retail disruption. People are why we innovate, and people power our innovations. We are people-led and tech-empowered. We train our team in the skillsets of the future and bring in experts like you to help us grow. We have roles for those chasing their first opportunity as well as those looking for the opportunity that will define their career. Here, you can kickstart a great career in tech, gain new skills and experience for virtually every industry, or leverage your expertise to innovate at scale, impact millions and reimagine the future of retail. Walmart’s culture is a competitive advantage, and it’s fostered by being together. Working together in person allows us to collaborate, align quickly and innovate with greater speed. We use our campuses to create purposeful connection rooted in deepening understanding and investing in the development of our associates. Our hubs: Walmart is a global company with offices across the United States and around the world. Our global headquarters is in Bentonville, Arkansas, with primary hubs in the San Francisco Bay area and New York/New Jersey. Benefits: Benefits: Beyond our great compensation package, you can receive incentive awards for your performance. Other great perks include 401(k) match, stock purchase plan, paid maternity and parental leave, PTO, multiple health plans, and much more.

Equal Opportunity Employer: Walmart, Inc. is an Equal Opportunity Employer – By Choice. We believe we are best equipped to help our associates, customers, and the communities we serve live better when we really know them. That means understanding, respecting, and valuing unique styles, experiences, identities, ideas, and opinions – while being inclusive of all people.

The above information has been designed to indicate the general nature and level of work performed in the role. It is not designed to contain or be interpreted as a comprehensive inventory of all responsibilities and qualifications required of employees assigned to this job. The full Job Description can be made available as part of the hiring process.

At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more. You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable. For information about PTO, see https://one.walmart.com/notices. Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart. Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms. For information about benefits and eligibility, see One.Walmart. The annual salary range for this position is $169,000.00 - $338,000.00 Additional compensation includes annual or quarterly performance bonuses. Additional compensation for certain positions may also include :

Stock

ㅤ

‎

Minimum Qualifications...

__Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications. __

Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 6 years’ experience in software engineering or related area. Option 2: 8 years’ experience in software engineering or related area. 3 years' supervisory experience.

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Master’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years' experience in software engineering or related area

Primary Location...

1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America

Walmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment.

Position Summary...

What you'll do...

Design, write and build tools to improve the reliability, latency, availability, and scalability of Walmart Tech stack including 1) Engender reliability and availability starting with metrics and measurements. 2) Enable scaling by providing tools, developing training and/or augmenting processes. 3) Build tools/automate to prevent re-occurrence of problem to mission critical products/services. 4) Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.
Drive team to build and scale fault-tolerant system and services in our hybrid cloud infrastructure.
Partner with leadership across organization to establish strategic plans and objectives to improve the mean time to detect and mean time to restore.
Collaborate with Service owners to define the SLOs and build SLIs to ensure systems are meeting the SLAs

Design, write and build tools to improve the reliability, latency, availability, and scalability of Walmart Tech stack including 1) Engender reliability and availability starting with metrics and measurements. 2) Enable scaling by providing tools, developing training and/or augmenting processes. 3) Build tools/automate to prevent re-occurrence of problem to mission critical products/services. 4) Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.
Drive team to build and scale fault-tolerant system and services in our hybrid cloud infrastructure.
Partner with leadership across organization to establish strategic plans and objectives to improve the mean time to detect and mean time to restore.
Collaborate with Service owners to define the SLOs and build SLIs to ensure systems are meeting the SLAs
Expert-level AI/ML engineering experience with deep expertise in machine learning algorithms, deep learning frameworks (TensorFlow, PyTorch), and production ML system deployment at scale.
Advanced experience with agentic AI systems including multi-agent frameworks, autonomous decision-making systems, LLM-based agents, and agent orchestration platforms.
Comprehensive Site Reliability Engineering expertise including hands-on experience with Service Management (Incident, Problem & Change Management), Performance and Capacity Engineering for AI/ML systems.
Expert-level cloud engineering experience (Azure, GCP, AWS) with deep knowledge of cloud-native AI/ML services, containerization (Kubernetes, Docker), and serverless architectures.
Deep observability and monitoring expertise with hands-on experience in:
- Distributed tracing (Jaeger, Zipkin, OpenTelemetry) for AI/ML pipelines
- Metrics collection and alerting (Prometheus, Grafana, DataDog) with ML-specific dashboards
- Log aggregation and analysis (ELK stack, Splunk, Fluentd) for model and system monitoring
- APM tools and performance monitoring for AI/ML workloads
- AI-driven anomaly detection and predictive monitoring systems
Platform Engineering experience including:
- Building developer platforms and internal tooling for AI/ML teams
- Infrastructure as Code (Terraform, CloudFormation, Pulumi)
- Service mesh architectures (Istio, Linkerd) for AI/ML services
- API gateway and microservices platform development
- Self-service ML deployment platforms and developer productivity tools
Industry & Domain Experience including:
- Experience in large-scale retail, e-commerce, or high-traffic consumer-facing systems with strict availability and performance requirements (strongly preferred).
- Experience with mission-critical distributed systems serving millions of concurrent users across multiple domains (e-commerce, payments, inventory, supply chain, etc.).
- Experience with enterprise-scale SRE implementations supporting diverse technology stacks and business-critical applications across multiple organizational domains.
- Experience with complex multi-cloud and hybrid cloud environments supporting diverse workloads with varying reliability and performance requirements.
- Technical Leadership & Collaboration Skills:
- Technical thought leadership and influence in AI/ML architecture decisions, SRE methodologies, and platform engineering strategies across all Walmart technology domains.
- Strong cross-functional collaboration experience working with diverse engineering teams across E-commerce, Supply Chain, Store Technology, Fintech, Security, and Platform Engineering to deliver enterprise-wide reliability solutions.
- Excellent technical communication skills with ability to articulate complex SRE and AI/ML concepts to diverse engineering audiences and influence technical decisions across multiple organizations.
- Mentorship and knowledge sharing experience, providing technical guidance on SRE best practices, AI/ML for reliability, and platform engineering through code reviews, technical discussions, and documentation.
- High degree of technical ownership and accountability for complex, mission-critical reliability systems with ability to work independently on high-impact projects that span multiple engineering domains.

Stock

ㅤ

‎

Minimum Qualifications...

__Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications. __

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Master’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years' experience in software engineering or related area

Primary Location...

1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America

Director, Software Engineering

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Position Summary...

What you'll do...

Minimum Qualifications...

Preferred Qualifications...

Primary Location...

Position Summary...

What you'll do...

Minimum Qualifications...

Preferred Qualifications...

Primary Location...