Position Summary...

The Reliability Engineering group at Walmart Global Tech builds intelligent, data-driven platforms that ensure the availability, performance, and efficiency of Walmartʼs enterprise and e-commerce systems at massive scale. The team leverages large-scale telemetry, automation, and machine learning to enable proactive optimization, faster incident detection, and resilient system behavior across thousands of services.

About the Team: Building the right technology foundation for Infrastructure & Platforms is critical to operating at Walmartʼs scale. Our team designs and maintains the core technologies that power the broader tech organization — including data platforms, observability systems, DevOps tooling, cloud infrastructure, and runtime automation frameworks. These systems support secure, reliable, and scalable operations across stores, digital platforms, and distribution centers worldwide.

What you'll do...

As a Principle ML Engineer, you will architect, build, and operate production-grade ML systems that directly influence runtime behavior across large-scale distributed systems. This is a hands-on engineering role with strong system design and ownership responsibilities. You will:

Architect and implement end-to-end ML systems (data pipelines, feature engineering, model training, deployment, and monitoring).
Design scalable, low-latency model serving infrastructure integrated with Kubernetes and cloud- native systems.
Build intelligent automation solutions including predictive autoscaling, anomaly detection, seasonality-aware forecasting, and capacity optimization.
Engineer safe and reliable ML-driven automation that operates in high-availability environments.
Own model lifecycle management, including validation, experiment tracking, model registry, monitoring, drift detection, and rollback strategies.
Collaborate closely with platform, SRE, and infrastructure teams to embed ML capabilities into production systems.
Drive engineering best practices around ML system reliability, observability, testing, andperformance.
Contribute to architectural decisions and mentor engineers on ML systems design.

Your solutions will operate at enterprise scale and directly impact system reliability, performance, and infrastructure cost efficiency. What Youʼll Bring: Core Experience

10+ years of experience in software engineering with applied machine learning.
Strong track record of building and operating ML systems in production.
Experience owning systems end-to-end in distributed, high-availability environments.
Experience leading technical initiatives or driving architectural decisions.

Technical Skills

Strong proficiency in one or more programming languages commonly used in ML engineering, such as Python, Go, or Java.
Strong experience with ML frameworks such as Scikit-learn, PyTorch, TensorFlow, or similar.
Strong SQL skills and experience working with large-scale datasets.
Hands-on experience training, validating, and deploying machine learning models in production across domains such as recommendation systems, forecasting, anomaly detection, classification, or similar applied ML use cases.
Experience building and maintaining end-to-end ML pipelines (data ingestion, feature engineering, training, evaluation, deployment, monitoring).
Experience with model serving architectures (REST/gRPC inference services, batch inference, streaming inference).
Hands-on experience with ML lifecycle platforms such as MLflow, Ray, Kubeflow, Airflow, or similar orchestration systems.
Experience with experiment tracking, model registry, CI/CD for ML, feature management, and automated retraining workflows.
Experience designing robust evaluation frameworks for traditional ML systems (offline validation, backtesting, shadow testing, A/B testing, and production performance monitoring).
Strong experience working with observability data (metrics, logs, traces) and time-series analysis in distributed systems.
Hands-on experience deploying and operating ML systems on Kubernetes, including containerization using Docker.
Experience working with major cloud platforms (AWS, GCP, or Azure) and cloud-native services.

Systems & Architecture Expertise

Strong understanding of distributed systems behavior (latency, throughput, failure modes, cascading effects).
Ability to design ML systems that balance accuracy, latency, reliability, and safety.
Experience designing fault-tolerant, observable, and scalable ML-driven automation systems.
Solid understanding of cloud-native architecture and infrastructure patterns.

Nice to Have

Experience building predictive or adaptive autoscaling systems.
Experience with AIOps, demand prediction, anomaly detection, or incident prediction in distributed environments.
Familiarity with streaming M L or online learning systems. Experience developing or integrating agentic systems, including orchestration, tool use, evaluation, and safety considerations in production environments. Familiarity with distributed training or large-scale data processing frameworks.

At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more. You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable. For information about PTO, see https://one.walmart.com/notices. Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart. Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms. For information about benefits and eligibility, see One.Walmart. The annual salary range for this position is $143,000.00 - $286,000.00 Additional compensation includes annual or quarterly performance bonuses. Additional compensation for certain positions may also include :

Stock

ㅤ

‎

Minimum Qualifications...

__Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications. __

Option 1: Bachelors degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology or related field and 5 years' experience in an analytics related field. Option 2: Masters degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology or related field and 3 years' experience in an analytics related field. Option 3: 7 years' experience in an analytics or related field

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Master’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years' experience in software engineering or related area.

Primary Location...

1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America

Walmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment.

Position Summary...

What you'll do...

Architect and implement end-to-end ML systems (data pipelines, feature engineering, model training, deployment, and monitoring).
Design scalable, low-latency model serving infrastructure integrated with Kubernetes and cloud- native systems.
Build intelligent automation solutions including predictive autoscaling, anomaly detection, seasonality-aware forecasting, and capacity optimization.
Engineer safe and reliable ML-driven automation that operates in high-availability environments.
Own model lifecycle management, including validation, experiment tracking, model registry, monitoring, drift detection, and rollback strategies.
Collaborate closely with platform, SRE, and infrastructure teams to embed ML capabilities into production systems.
Drive engineering best practices around ML system reliability, observability, testing, andperformance.
Contribute to architectural decisions and mentor engineers on ML systems design.

Your solutions will operate at enterprise scale and directly impact system reliability, performance, and infrastructure cost efficiency. What Youʼll Bring: Core Experience

10+ years of experience in software engineering with applied machine learning.
Strong track record of building and operating ML systems in production.
Experience owning systems end-to-end in distributed, high-availability environments.
Experience leading technical initiatives or driving architectural decisions.

Technical Skills

Strong proficiency in one or more programming languages commonly used in ML engineering, such as Python, Go, or Java.
Strong experience with ML frameworks such as Scikit-learn, PyTorch, TensorFlow, or similar.
Strong SQL skills and experience working with large-scale datasets.
Hands-on experience training, validating, and deploying machine learning models in production across domains such as recommendation systems, forecasting, anomaly detection, classification, or similar applied ML use cases.
Experience building and maintaining end-to-end ML pipelines (data ingestion, feature engineering, training, evaluation, deployment, monitoring).
Experience with model serving architectures (REST/gRPC inference services, batch inference, streaming inference).
Hands-on experience with ML lifecycle platforms such as MLflow, Ray, Kubeflow, Airflow, or similar orchestration systems.
Experience with experiment tracking, model registry, CI/CD for ML, feature management, and automated retraining workflows.
Experience designing robust evaluation frameworks for traditional ML systems (offline validation, backtesting, shadow testing, A/B testing, and production performance monitoring).
Strong experience working with observability data (metrics, logs, traces) and time-series analysis in distributed systems.
Hands-on experience deploying and operating ML systems on Kubernetes, including containerization using Docker.
Experience working with major cloud platforms (AWS, GCP, or Azure) and cloud-native services.

Systems & Architecture Expertise

Strong understanding of distributed systems behavior (latency, throughput, failure modes, cascading effects).
Ability to design ML systems that balance accuracy, latency, reliability, and safety.
Experience designing fault-tolerant, observable, and scalable ML-driven automation systems.
Solid understanding of cloud-native architecture and infrastructure patterns.

Nice to Have

Experience building predictive or adaptive autoscaling systems.
Experience with AIOps, demand prediction, anomaly detection, or incident prediction in distributed environments.
Familiarity with streaming M L or online learning systems. Experience developing or integrating agentic systems, including orchestration, tool use, evaluation, and safety considerations in production environments. Familiarity with distributed training or large-scale data processing frameworks.

Stock

ㅤ

‎

Minimum Qualifications...

__Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications. __

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Master’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years' experience in software engineering or related area.

Primary Location...

1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America

(usa) Principal, Data Scientist

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Position Summary...

What you'll do...

Minimum Qualifications...

Preferred Qualifications...

Primary Location...

Position Summary...

What you'll do...

Minimum Qualifications...

Preferred Qualifications...

Primary Location...