Staff, Software Engineer

Walmart Walmart · Retail · Sunnyvale, CA

Staff Software Engineer to design and build enterprise-scale Marketplace platforms supporting Seller Risk. This role is backend-heavy, with strong expectations for Java-based distributed systems, while also providing technical leadership for React-based front-end applications. You will operate as a senior technical leader, driving architecture, system design, and engineering excellence across the full stack. Responsibilities include designing and building highly scalable backend microservices using Java and Spring Boot, architecting and implementing real-time event-driven systems using Apache Kafka, and developing and optimizing large-scale batch and streaming data pipelines using Apache Spark. The role requires strong ownership, deep system thinking, and the ability to design for high throughput, low latency, and extreme reliability.

What you'd actually do

  1. Design and build highly scalable backend microservices using Java and Spring Boot.
  2. Architect and implement real-time event-driven systems using Apache Kafka.
  3. Develop and optimize large-scale batch and streaming data pipelines using Apache Spark.
  4. Drive architecture decisions around scalability, resiliency, observability, and cost efficiency.
  5. Lead system design reviews and define engineering best practices for distributed systems.

Skills

Required

  • 12+ years of experience in backend and distributed systems engineering.
  • Strong hands-on experience in Java and Spring Boot for building production-grade microservices.
  • Deep expertise in Apache Kafka: Topic design and partitioning, Consumer group scaling and offset management, Delivery semantics (at-least-once / exactly-once), Stream processing patterns and performance tuning.
  • Strong hands-on experience with Apache Spark: Batch and Structured Streaming workloads, Job optimization (shuffle tuning, memory tuning, skew handling), Working with large-scale datasets.
  • Proven experience building systems operating at large scale (millions–billions of events / high TPS platforms).
  • Experience designing event-driven microservices architectures.
  • Strong understanding of distributed systems fundamentals: Fault tolerance, Back-pressure, Idempotency, Consistency trade-offs.
  • Experience with cloud-native deployments (Kubernetes, Docker, AWS/GCP/Azure).
  • Experience with NoSQL / analytical data stores such as Cassandra, BigQuery, HBase, or similar.
  • Strong production debugging and performance tuning skills.

Nice to have

  • Direct experience building or deeply customizing platforms like Temporal.io, Cadence, Apache Airflow, or Argo Workflows.
  • Distributed State Management & Durable Execution.
  • Deep State Knowledge: Experience managing the state of long-running processes that must survive infrastructure failures, network partitions, and deployments.
  • Event Sourcing & CQRS: Familiarity with using event-sourcing patterns to rebuild the state of a workflow by replaying history.
  • Transactions: Understanding of the Saga Pattern for managing distributed transactions and implementing compensations (rollbacks) across microservices.
  • Fault Tolerance & High Availability.
  • Idempotency Mastery: Expertise in designing systems where tasks can be retried indefinitely without side effects—a critical requirement for any orchestration engine.
  • Advanced Retry Policies: Knowledge of jitter, exponential backoff, and circuit breakers to prevent "thundering herd" problems when a downstream service fails.
  • Rate Limiting & Quotas: Experience building multi-tenant throttling mechanisms to ensure one massive workflow doesn't starve others of resources.
  • DSL Design: Experience designing Domain-Specific Languages (YAML, JSON, or Python-based) that allow users to define complex logic simply.
  • SDK Development: Ability to build client-side libraries that abstract away the complexity of the underlying orchestration engine for other developers.
  • Message Brokers: Professional experience with Kafka, Pulsar, or RabbitMQ specifically used as a task distribution layer.
  • Priority Queuing: Implementing logic to handle "hot" tasks vs. background tasks efficiently.
  • Hands-on experience with existing orchestrators such as Temporal.io, Cadence, Apache Airflow, Argo Workflows, or AWS Step Functions.
  • An understanding of why these tools succeed (or fail) in specific use cases.
  • Experience in retail, supply chain, pricing, ads, or e-commerce platforms.
  • Exposure to real-time analytics, recommendation engines, or fraud detection systems.

What the JD emphasized

  • Java and Spring Boot
  • Apache Kafka
  • Apache Spark
  • high throughput, low latency, and extreme reliability
  • large scale (millions–billions of events / high TPS platforms)
  • event-driven microservices architectures
  • distributed systems fundamentals
  • cloud-native deployments (Kubernetes, Docker, AWS/GCP/Azure)
  • NoSQL / analytical data stores