(usa) Principal, Data Scientist

Walmart · Retail · Sunnyvale, CA

Principal ML Engineer to architect, build, and operate production-grade ML systems for Walmart's enterprise and e-commerce systems. Focus on intelligent automation, predictive scaling, anomaly detection, and capacity optimization, influencing runtime behavior at massive scale. Requires strong system design, end-to-end ownership, and experience with ML lifecycle management, serving infrastructure, and observability in distributed, high-availability environments.

What you'd actually do

  1. Architect and implement end-to-end ML systems (data pipelines, feature engineering, model training, deployment, and monitoring).
  2. Design scalable, low-latency model serving infrastructure integrated with Kubernetes and cloud- native systems.
  3. Build intelligent automation solutions including predictive autoscaling, anomaly detection, seasonality-aware forecasting, and capacity optimization.
  4. Engineer safe and reliable ML-driven automation that operates in high-availability environments.
  5. Own model lifecycle management, including validation, experiment tracking, model registry, monitoring, drift detection, and rollback strategies.

Skills

Required

  • Python
  • Go
  • Java
  • Scikit-learn
  • PyTorch
  • TensorFlow
  • SQL
  • MLflow
  • Ray
  • Kubeflow
  • Airflow
  • Docker
  • Kubernetes
  • AWS
  • GCP
  • Azure
  • distributed systems
  • system design
  • ML pipelines
  • model serving
  • experiment tracking
  • model registry
  • CI/CD for ML
  • feature management
  • automated retraining
  • evaluation frameworks
  • observability
  • time-series analysis

Nice to have

  • predictive autoscaling
  • AIOps
  • demand prediction
  • anomaly detection
  • incident prediction
  • streaming ML
  • online learning
  • agentic systems
  • orchestration
  • tool use
  • agent evaluation
  • agent safety
  • distributed training
  • large-scale data processing

What the JD emphasized

  • production-grade ML systems
  • end-to-end ML systems
  • low-latency model serving infrastructure
  • ML-driven automation
  • model lifecycle management
  • ML system reliability
  • ML system observability
  • ML system performance
  • applied machine learning
  • building and operating ML systems in production
  • owning systems end-to-end
  • distributed, high-availability environments
  • leading technical initiatives
  • driving architectural decisions
  • training, validating, and deploying machine learning models in production
  • building and maintaining end-to-end ML pipelines
  • model serving architectures
  • ML lifecycle platforms
  • experiment tracking
  • model registry
  • CI/CD for ML
  • feature management
  • automated retraining workflows
  • robust evaluation frameworks
  • observability data
  • time-series analysis
  • deploying and operating ML systems on Kubernetes
  • containerization using Docker
  • major cloud platforms
  • cloud-native services
  • distributed systems behavior
  • design ML systems that balance accuracy, latency, reliability, and safety
  • designing fault-tolerant, observable, and scalable ML-driven automation systems
  • cloud-native architecture
  • infrastructure patterns

Other signals

  • production ML systems
  • ML-driven automation
  • ML system reliability
  • ML system observability
  • ML system performance