Staff, Software Engineer

Walmart · Retail · Sunnyvale, CA

Staff Software Engineer (Machine Learning Engineering) at Walmart, focused on designing and building large-scale, production-grade ML systems for Driver Search and Matching. This role involves end-to-end ML system architecture, model serving, real-time decisioning, scalability, reliability, and observability, with a focus on millions of real-time requests per second.

What you'd actually do

  1. Architect and build scalable, low-latency ML systems for millions of real-time driver search and matching per second.
  2. Lead the end-to-end productionization of ML models, including serving, feature pipelines, and online inference systems.
  3. Design and optimize distributed systems for high-throughput, real-time decision-making.
  4. Collaborate with Applied Scientists to translate ML models (e.g., ML, ranking, optimization, RL) into production-ready systems.
  5. Own system reliability, monitoring, and observability, ensuring high availability and performance at scale.

Skills

Required

  • Ph.D. or Master in Computer Science or a related field.
  • 5+ years of experience in software engineering, with strong focus on backend systems and/or ML engineering.
  • Proven experience building and scaling real-time, distributed systems in production environments.
  • Strong experience with ML production systems, including model serving, feature stores, and inference pipelines.
  • Proficiency in programming languages such as Java, Python, or C++.
  • Experience with large-scale data and streaming systems (e.g., Kafka, Spark, Flink) and cloud platforms (GCP/Azure).
  • Deep understanding of system design, scalability, and performance optimization.

Nice to have

  • Experience working with ranking systems, recommendation systems, or marketplace optimization is a plus.

What the JD emphasized

  • large-scale, production-grade ML systems
  • Driver Search and Matching
  • end-to-end ML system architecture
  • model serving
  • real-time decisioning
  • system scalability
  • reliability
  • observability
  • millions of real-time driver search and matching per second
  • productionization of ML models
  • online inference systems
  • high-throughput, real-time decision-making
  • production-ready systems
  • system reliability, monitoring, and observability
  • high availability and performance at scale

Other signals

  • production ML systems
  • real-time decisioning
  • scalability
  • reliability
  • observability