Director, Software Engineering

Walmart · Retail · Sunnyvale, CA

Director of Software Engineering for Site Reliability Engineering, focusing on building AI-powered, self-healing, and autonomous systems for Walmart's infrastructure and platforms. The role involves designing and building resilient agentic platforms, driving the transformation of SRE practices using AI/ML, and ensuring the reliability, scalability, and availability of critical systems. Requires expertise in AI/ML engineering, agentic AI systems, SRE, cloud engineering, observability, and platform engineering.

What you'd actually do

  1. Design, write and build tools to improve the reliability, latency, availability, and scalability of Walmart Tech stack including 1) Engender reliability and availability starting with metrics and measurements. 2) Enable scaling by providing tools, developing training and/or augmenting processes. 3) Build tools/automate to prevent re-occurrence of problem to mission critical products/services. 4) Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.
  2. Drive team to build and scale fault-tolerant system and services in our hybrid cloud infrastructure.
  3. Partner with leadership across organization to establish strategic plans and objectives to improve the mean time to detect and mean time to restore.
  4. Collaborate with Service owners to define the SLOs and build SLIs to ensure systems are meeting the SLAs

Skills

Required

  • AI/ML engineering
  • machine learning algorithms
  • deep learning frameworks (TensorFlow, PyTorch)
  • production ML system deployment at scale
  • agentic AI systems
  • multi-agent frameworks
  • autonomous decision-making systems
  • LLM-based agents
  • agent orchestration platforms
  • Site Reliability Engineering (SRE)
  • Service Management (Incident, Problem & Change Management)
  • Performance and Capacity Engineering for AI/ML systems
  • cloud engineering (Azure, GCP, AWS)
  • cloud-native AI/ML services
  • containerization (Kubernetes, Docker)
  • serverless architectures
  • distributed tracing (Jaeger, Zipkin, OpenTelemetry)
  • metrics collection and alerting (Prometheus, Grafana, DataDog)
  • log aggregation and analysis (ELK stack, Splunk, Fluentd)
  • APM tools
  • AI-driven anomaly detection
  • predictive monitoring systems
  • developer platforms and internal tooling for AI/ML teams
  • Infrastructure as Code (Terraform, CloudFormation, Pulumi)
  • service mesh architectures (Istio, Linkerd)
  • API gateway and microservices platform development
  • self-service ML deployment platforms
  • developer productivity tools
  • large-scale retail, e-commerce, or high-traffic consumer-facing systems

Nice to have

  • AI-powered, self-healing, and autonomous systems
  • intelligent capacity management
  • predictive performance optimization
  • ML-specific dashboards
  • model and system monitoring
  • performance monitoring for AI/ML workloads

What the JD emphasized

  • Expert-level AI/ML engineering experience
  • Advanced experience with agentic AI systems
  • Comprehensive Site Reliability Engineering expertise
  • Expert-level cloud engineering experience
  • Deep observability and monitoring expertise
  • Platform Engineering experience

Other signals

  • AI-powered, self-healing, and autonomous systems
  • agentic platforms
  • intelligent capacity management
  • predictive performance optimization
  • AI/ML engineering experience
  • agentic AI systems
  • LLM-based agents
  • agent orchestration platforms
  • AI/ML systems deployment at scale
  • AI/ML services
  • AI-driven anomaly detection
  • Platform Engineering experience...Building developer platforms and internal tooling for AI/ML teams