Senior, Software Engineer

Walmart · Retail · Sunnyvale, CA

Senior Software Engineer on the Search PTE-DevOps team at Walmart, focusing on supporting systems and services for high availability and reliability. The role involves embracing AI-augmented workflows, supporting AI model deployments, and working with GenAI and LLMOps pipelines. Key responsibilities include building and operating tools for developing, scaling, and monitoring technology, triaging technical issues, and ensuring five 9's reliability at the intersection of AI and platform engineering. The engineer will manage QE & Release Automation frameworks, Kubernetes-based containerization (including GPU workloads), investigate incidents, build monitoring for applications and AI model performance, maintain CI/CD and MLOps pipelines, integrate AI coding assistants, design AI-powered observability, and collaborate with AI/ML teams on LLM-based features, prompt pipelines, and vector search infrastructure. The role also involves driving projects, analyzing/building frameworks with AI tools, providing architectural guidance, and performing quality assurance for AI-powered features and inference pipelines.

What you'd actually do

  1. Build, manage, and evolve QE & Release Automation frameworks, incorporating AI-assisted test generation and self-healing test capabilities
  2. Build and support Kubernetes-based containerization in production, including GPU-backed workloads for AI/ML inference
  3. Lead independently the investigation and resolution of high-impact search system and AI service incidents
  4. Build, manage, and support comprehensive monitoring and observability for applications and AI model performance (drift, latency, accuracy)
  5. Maintain and improve automation pipelines supporting application build, release, and AI model deployment cycles (CI/CD + MLOps/LLMOps)

Skills

Required

  • Kubernetes
  • Python
  • Go
  • Java
  • Shell scripting
  • CI/CD platforms
  • GitOps workflows
  • AI/ML workflows
  • model serving
  • inference optimization
  • LLM deployment pipelines
  • observability stacks
  • OpenTelemetry
  • distributed tracing
  • log aggregation
  • AI-assisted anomaly detection

Nice to have

  • prompt engineering
  • RAG pipelines
  • vector databases
  • LLM evaluation frameworks
  • AI coding assistants
  • AI-augmented DevOps tooling
  • WCNP
  • GCP
  • Azure
  • eBPF-based observability tools
  • advanced networking concepts
  • GPU infrastructure management
  • MLflow
  • Kubeflow
  • Ray
  • MLOps platforms

What the JD emphasized

  • five 9’s reliability
  • AI and platform engineering
  • AI model release cycles
  • AI-assisted test generation
  • GPU-backed workloads for AI/ML inference
  • AI service incidents
  • AI model performance
  • AI model deployment cycles
  • AI coding assistants
  • GenAI tooling
  • AI-powered observability solutions
  • LLM-based features
  • prompt pipeline management
  • vector search infrastructure
  • AI/ML platform initiatives
  • AI tools
  • AI integration patterns
  • AI model validation
  • LLM/AI inference pipelines
  • AI/ML tooling
  • LLMOps
  • GenAI platforms
  • RAG pipelines
  • vector databases
  • LLM evaluation frameworks
  • AI coding assistants
  • AI-augmented DevOps tooling
  • GPU infrastructure management for AI workloads
  • MLOps platforms

Other signals

  • AI-augmented workflows
  • AI model deployments
  • GenAI and LLMOps pipelines
  • AI and platform engineering
  • AI model release cycles
  • AI-assisted test generation
  • GPU-backed workloads for AI/ML inference
  • AI service incidents
  • AI model performance
  • AI model deployment cycles
  • AI coding assistants
  • GenAI tooling
  • AI-powered observability solutions
  • AI/ML teams
  • LLM-based features
  • prompt pipeline management
  • vector search infrastructure
  • AI/ML platform initiatives
  • AI tools
  • AI integration patterns
  • AI-driven requirements
  • AI model validation
  • AI-powered features
  • LLM/AI inference pipelines
  • AI/ML tooling
  • LLMOps
  • GenAI platforms
  • RAG pipelines
  • vector databases
  • LLM evaluation frameworks
  • AI coding assistants
  • AI-augmented DevOps tooling
  • GPU infrastructure management for AI workloads
  • MLOps platforms