(usa) Senior, Software Engineer

Walmart Walmart · Retail · Sunnyvale, CA

Senior Software Engineer in DevOps & AI Platform role at Walmart, focusing on supporting systems and services for high availability and reliability. The role involves embracing AI-augmented workflows, working with developers and AI/ML engineers to support application features, AI model deployments, and service launches. Key responsibilities include designing, building, and operating tools for developing, scaling, and monitoring GenAI and LLMOps pipelines, triaging technical issues, and ensuring five 9’s reliability at the intersection of AI and platform engineering. Expertise is required in CI/CD, containerized infrastructure, and AI-assisted development practices, with a critical role in search application and AI model release cycles.

What you'd actually do

  1. Build, manage, and evolve QE & Release Automation frameworks, incorporating AI-assisted test generation and self-healing test capabilities
  2. Build and support Kubernetes-based containerization in production, including GPU-backed workloads for AI/ML inference
  3. Lead independently the investigation and resolution of high-impact search system and AI service incidents
  4. Build, manage, and support comprehensive monitoring and observability for applications and AI model performance (drift, latency, accuracy)
  5. Maintain and improve automation pipelines supporting application build, release, and AI model deployment cycles (CI/CD + MLOps/LLMOps)

Skills

Required

  • Kubernetes (including multi-cluster and GPU-node management)
  • Python
  • Go
  • Java
  • Shell scripting
  • REST API frameworks
  • gRPC API frameworks
  • Concord
  • GitHub Actions
  • Looper
  • GitOps workflows
  • ArgoCD
  • Flux
  • AI/ML workflows
  • model serving
  • inference optimization
  • LLM deployment pipelines
  • OpenTelemetry
  • distributed tracing
  • log aggregation
  • Splunk
  • OpenObserve
  • AI-assisted anomaly detection

Nice to have

  • Prompt engineering
  • RAG pipelines
  • vector databases
  • Pinecone
  • Weaviate
  • Elasticsearch KNN
  • LLM evaluation frameworks
  • Wibey
  • GitHub Copilot
  • WCNP (Walmart Cloud Native Platform)
  • GCP
  • Azure
  • eBPF-based observability tools
  • Cilium
  • Pixie
  • advanced networking concepts
  • VIP
  • TCP
  • Envoy/Istio service mesh
  • CUDA
  • NVIDIA device plugins for Kubernetes
  • MLflow
  • Kubeflow
  • Ray
  • MLOps platforms
  • experiment tracking
  • model lifecycle management

What the JD emphasized

  • five 9’s reliability
  • AI and platform engineering
  • AI/ML tooling
  • LLMOps
  • GenAI platforms
  • AI/ML workflows
  • model serving
  • inference optimization
  • LLM deployment pipelines
  • observability stacks
  • AI-assisted anomaly detection
  • LLMOps and GenAI platforms
  • RAG pipelines
  • vector databases
  • LLM evaluation frameworks
  • AI coding assistants
  • GPU infrastructure management for AI workloads

Other signals

  • AI-augmented workflows
  • AI model deployments
  • GenAI and LLMOps pipelines
  • five 9’s reliability
  • AI and platform engineering
  • AI-assisted test generation
  • GPU-backed workloads for AI/ML inference
  • AI service incidents
  • AI model performance
  • AI model deployment cycles
  • AI coding assistants
  • AI-powered observability solutions
  • operationalize LLM-based features
  • vector search infrastructure
  • AI/ML platform initiatives
  • AI tools
  • AI integration patterns
  • AI model validation
  • AI-powered features
  • LLM/AI inference pipelines
  • AI/ML tooling
  • LLMOps
  • GenAI platforms
  • AI/ML workflows
  • model serving
  • inference optimization
  • LLM deployment pipelines
  • observability stacks
  • AI-assisted anomaly detection
  • LLMOps and GenAI platforms
  • RAG pipelines
  • vector databases
  • LLM evaluation frameworks
  • AI coding assistants
  • GPU infrastructure management for AI workloads