Principal, Data Scientist, Experimentation Sciences

Walmart · Retail · Sunnyvale, CA +1

This role defines and executes the data science roadmap for Walmart's experimentation platform, focusing on AI evaluation, LLM evals, and measurement systems. It involves hands-on technical leadership in statistical frameworks, data pipelines, and establishing AI-native experimentation strategies.

What you'd actually do

  1. Define the multi-year data science roadmap for experimentation reporting, dashboards, and measurement services, identifying the highest-leverage investments in methodology, automation, and self-service.
  2. Lead the design of scalable statistical frameworks for online experiments across product, business, and operational use cases, including guardrails, heterogeneity analysis, sequential decisioning, variance reduction, and quasi-experimental methods when randomized tests are not feasible.
  3. Partner with data engineering to design robust SQL and PySpark data models, pipelines, and observability standards that improve correctness, speed, and reusability of experimentation data assets.
  4. Establish and govern canonical experiment metrics, scorecards, and reporting standards across channels, regions, and surfaces.
  5. Define the strategy for AI-native experimentation and evaluation, including LLM eval frameworks, prompt evaluation, golden datasets, rubric design, human-in-the-loop review, LLM-as-a-judge calibration, and ongoing regression monitoring.

Skills

Required

  • Deep expertise in experimentation, causal inference, and statistical decision-making
  • Expert-level SQL and PySpark
  • Strong Python skills
  • Hands-on experience working with high-volume, distributed data pipelines in production environments
  • Experience building or materially improving experimentation platforms, measurement systems, or internal science tooling
  • Strong understanding of metric design, guardrails, data quality, and observability for experimentation systems
  • Working knowledge of modern AI evaluation methods, including LLM evals, prompt experimentation, model or prompt regression testing, and hybrid human-plus-automated quality frameworks

Nice to have

  • Experience in e-commerce, retail, marketplace, logistics, last-mile delivery, or other high-scale consumer platforms with complex operational feedback loops
  • Self-starter mindset, with the ability to work through ambiguity, define a roadmap, and independently drive ideas from concept to execution.

What the JD emphasized

  • AI evaluation
  • LLM evals
  • experimentation platform
  • measurement systems
  • statistical tooling
  • AI-native experimentation

Other signals

  • AI evaluation
  • LLM evals
  • experimentation platform
  • measurement systems
  • statistical tooling