Senior, Data Scientist

Walmart · Retail · Sunnyvale, CA

The role focuses on designing and building multi-modal evaluation frameworks and AI-as-a-judge systems, primarily for generative AI content in Extended Reality experiences. It involves fine-tuning Vision-Language Models and Reward Models, leading experimentation with causal inference, orchestrating RLHF strategies, and collaborating on model launch criteria. The role also involves staying current with research in perceptual quality and implementing techniques to detect issues in generated content.

What you'd actually do

  1. Design Multi-Modal Evaluation Frameworks
  2. Build "AI-as-a-Judge" Systems
  3. Lead Experimentation & Causal Inference
  4. Orchestrate Human-in-the-Loop (RLHF) Strategy
  5. Strategic Cross-Functional Partnership

Skills

Required

  • Master's degree in Computer Science with a specialization in Computer Vision, Machine Learning, or equivalent practical experience
  • 3+ years of experience with machine learning algorithms and tools
  • Strong foundation in statistical analysis, experimental design (A/B testing), and causal inference
  • Hands-on experience with Generative AI evaluation (e.g., using LLMs/VLMs for evaluation, computing FID/IS/CLIP scores, or designing perceptual studies)
  • Proficiency in Python and deep learning frameworks (PyTorch, TensorFlow) for analyzing model outputs and building evaluation pipelines
  • Experience processing unstructured data (image, video, 3D meshes) for analytical purposes

Nice to have

  • PhD in Machine Learning, Computer Science, or a related technical field
  • Experience designing Reward Models for RLHF pipelines
  • Deep understanding of 3D geometry processing (meshes, point clouds) and how to mathematically quantify "3D quality" (e.g., mesh manifoldness, texture resolution)
  • Experience with Crowdsourcing platforms and designing instructions for subjective human evaluation
  • Publication record or practical experience in Computational Photography, Computer Vision Quality Assessment, or Psychophysics
  • Experience with Big Data tools (Spark, SQL, BigQuery) for analyzing large-scale experiment results

What the JD emphasized

  • non-deterministic outputs
  • automated evaluators
  • downstream business impact
  • human evaluation
  • model launch criteria
  • hallucinations, artifacts, or bias

Other signals

  • designing evaluation metrics
  • building AI-as-a-judge systems
  • leading experimentation
  • orchestrating RLHF strategy
  • establishing model launch criteria