Principal, Data Scientist

Walmart Walmart · Retail · Bentonville, AR

Principal Data Scientist to lead the vision, architecture, and implementation of Walmart’s next-generation AI infrastructure, GenAI platforms, and engineering capabilities. This role will drive the end-to-end design of high-performance AI systems, including model training and serving infrastructure, distributed inference pipelines, data and feature platforms, retrieval/grounding services, and automation frameworks for AI agents and real-time decisioning at global scale. Responsibilities include architecting distributed training, vector retrieval systems, LLM inference optimization, and scalable agent backends, as well as developing GenAI platform components like RAG pipelines, vector search, and model serving. The role also involves optimizing AI workloads, leading operational excellence, and defining engineering standards for AI observability and responsible AI.

What you'd actually do

  1. Serve as Walmart’s principal architect for enterprise AI infrastructure, defining patterns for distributed training, vector retrieval systems, LLM inference optimization, and scalable agent backends.
  2. Lead design of GenAI platform components, including: o Orchestration and lifecycle management for LLMs and multimodal models o Retrieval-augmented generation (RAG) pipelines o Vector search and embedding pipelines o Model serving, autoscaling, and high-availability inference systems
  3. Build and evolve core GenAI platform capabilities used across eCommerce Analytics: o Agent orchestration and backend frameworks o Tool/action routing services o System prompts, grounding, and context management o Observability, evaluation, safety, and feedback loops
  4. Architect and optimize high-scale data and inference pipelines running on Dataproc, Kubernetes, Airflow, and cloud-native distributed systems.
  5. Define engineering standards for AI observability, versioning, deployment, model telemetry, and responsible AI integrations.

Skills

Required

  • Python
  • SQL
  • AI Infrastructure & Distributed Systems Expertise
  • Large-scale compute, networking, cluster orchestration, cloud infrastructure, and performance engineering
  • LLM inference systems
  • Embedding & vector search pipelines
  • Model serving & batch/streaming inference
  • Real-time AI agent backends
  • Vertex AI
  • Google Cloud AI infrastructure
  • BigQuery
  • Graph DB technologies
  • ML engineering
  • model evaluation frameworks
  • CI/CD systems for AI workloads
  • cloud-native observability frameworks

Nice to have

  • Scala
  • Java
  • R
  • Bash
  • Microsoft Fabric
  • Microsoft Foundry AI services

What the JD emphasized

  • enterprise AI infrastructure
  • GenAI platforms
  • AI agents
  • real-time decisioning
  • model training and serving infrastructure
  • distributed inference pipelines
  • retrieval/grounding services
  • automation frameworks
  • LLM inference optimization
  • scalable agent backends
  • RAG pipelines
  • Vector search and embedding pipelines
  • model serving, autoscaling, and high-availability inference systems
  • Agent orchestration and backend frameworks
  • Tool/action routing services
  • System prompts, grounding, and context management
  • Observability, evaluation, safety, and feedback loops
  • high-scale data and inference pipelines
  • performance of AI workloads
  • Distributed training optimization
  • Token-level and batch-level inference acceleration
  • Memory and compute efficiency strategies for LLMs
  • operational excellence across AI systems
  • engineering standards for AI observability, versioning, deployment, model telemetry, and responsible AI integrations

Other signals

  • AI infrastructure architecture
  • GenAI platforms
  • AI agents
  • real-time decisioning
  • model training and serving infrastructure
  • distributed inference pipelines
  • retrieval/grounding services
  • automation frameworks
  • LLM inference optimization
  • scalable agent backends
  • RAG pipelines
  • Vector search and embedding pipelines
  • model serving, autoscaling, and high-availability inference systems
  • Agent orchestration and backend frameworks
  • Tool/action routing services
  • System prompts, grounding, and context management
  • Observability, evaluation, safety, and feedback loops
  • high-scale data and inference pipelines
  • performance of AI workloads
  • Distributed training optimization
  • Token-level and batch-level inference acceleration
  • Memory and compute efficiency strategies for LLMs
  • operational excellence across AI systems
  • engineering standards for AI observability, versioning, deployment, model telemetry, and responsible AI integrations