Principal, Software Engineer – Conversational AI

Walmart · Retail · Bentonville, AR

Walmart's Cortex Team is seeking a Principal Software Engineer to build and evolve their core AI conversational platform. This role involves designing and implementing NLU services, orchestrating model-serving microservices, optimizing model serving for latency and cost, and potentially working on prompt engineering and agentic systems. The position requires strong software engineering fundamentals, experience with large-scale distributed systems, and a focus on scalability, performance, and cost trade-offs.

What you'd actually do

  1. design, build, improve and evolve our capabilities in at least some of the following areas: Service oriented architecture in charge of exposing our NLU capabilities at scale, and enabling increasingly sophisticated model orchestration.
  2. design and build the primitives to efficiently orchestrate model-serving microservices, taking into account their dependencies, and improving thecombinedlatency and robustness of such microservices (e.g. fan out in parallel to N services for a single request, and reply with whichever gives the fastest answer).
  3. bake-in functionality which can drive improved machine learning modeling and experimental design, such as A/B testing.
  4. Model serving and operations
  5. drive principled and scientific load-testing efforts, to clearly identify the tradeoffs at hands, and tune/optimize the model-serving stack.

Skills

Required

  • 8 + years experience in software engineering or related area
  • Solid data skills
  • sound computer-science fundamentals
  • strong programming experience
  • Deep hands-on technical expertise in full-stack development
  • Programming experience with at least one modern language with an efficient runtime, such as Scala, Java, C++, or C#
  • Experience with at least one relational database technology such as My SQL, Postgre SQL, Oracle, or MS SQL
  • Some level of fluency in Python
  • Understanding of the challenge of distributed data-processing at scale
  • Deal well with ambiguous/undefined problems
  • ability to think abstractly
  • Ability to take a project from scoping requirements through actual launch
  • A continuous drive to explore, improve, enhance, automate, and optimize systems and tools
  • Capacity to apply scientific analysis and mathematical modeling techniques
  • Excellent oral and written communication skills
  • Bachelors degree or certification in Computer Science, Engineering, Mathematics, or any other related field

Nice to have

  • Large scale distributed systems experience, including scalability and fault tolerance
  • Experience taking a leading role in building complex data-driven software systems successfully delivered to customers
  • Relentless focus on scalability, latency, performance robustness, and cost trade-offs especially those present in highly virtualized, elastic, cloud-based environments
  • Exposure to cloud infrastructure, such as Open Stack, Azure, GCP, or AWS as well as infrastructure management tech (Docker, Kubernetes)
  • Experience building/operating highly available systems of data extraction, ingestion, and massively parallel processing for large data sets
  • In particular experience in building large scale data pipelines using big data technologies (e.g. Spark / Kafka / Cassandra / Hadoop / Hive / Big Query / Presto / Airflow)
  • Hands-on expertise in many disparate technologies, typically ranging from front-end user interfaces through to back-end systems and all points in between
  • Familiarity with Machine Learning

What the JD emphasized

  • core A.I. conversational platform
  • personal assistants
  • multi-modal experiences
  • Natural Language Understanding (NLU) services
  • orchestration
  • model-serving microservices
  • scalability and availability
  • model serving latency
  • operational costs
  • model serving
  • load-testing efforts
  • prompt engineering and agentic systems
  • reproducible workflow and models
  • continuous deployment
  • resource management capabilities
  • diagnostics for quality control
  • labeling tools
  • mission critical product
  • large scale distributed systems experience
  • scalability and fault tolerance
  • building complex data-driven software systems
  • scalability, latency, performance robustness, and cost trade-offs
  • highly virtualized, elastic, cloud-based environments
  • building/operating highly available systems
  • large scale data pipelines

Other signals

  • building and designing the next generation of Natural Language Understanding (NLU) services
  • design and build the primitives to efficiently orchestrate model-serving microservices
  • bake-in functionality which can drive improved machine learning modeling and experimental design, such as A/B testing
  • Model serving and operations
  • drive principled and scientific load-testing efforts
  • prompt engineering and agentic systems