(usa) Principal, Software Engineer

Walmart · Retail · Bentonville, AR +2

Principal Software Engineer for the Colony Platform within Walmart's AI & Data organization. This role focuses on building an agentic AI framework to enable associates to quickly, safely, and cost-effectively develop AI-based solutions. The engineer will serve as a senior technical authority and hands-on architect, shaping system architecture, influencing engineering standards, and driving platform strategy for AI-powered experiences at enterprise scale.

What you'd actually do

  1. Own high-impact architectural decisions
  2. Drive scalable, resilient system design
  3. Prototype and productionize advanced AI-enabled capabilities
  4. Mentor senior engineers and act as a force multiplier
  5. Balance long-term platform sustainability with near-term business outcomes

Skills

Required

  • 12+ years of experience building highly available, distributed systems
  • Proven track record delivering complex, enterprise-scale software systems from inception to production
  • Strong proficiency in Python (building libraries/services/tools), including packaging/dependencies, logging, and performance troubleshooting
  • Working knowledge of OAuth2/OIDC authentication and scope/permission models
  • Familiarity with schema/contract frameworks (JSON Schema, OpenAPI, Pydantic, protobuf) and backward-compatible tool evolution
  • Experience with observability: structured logging, metrics, traces, and debugging distributed flows across client + gateway
  • Experience working with AI/ML ecosystems in production environments
  • Strong architectural judgment and ability to ev

What the JD emphasized

  • agentic AI framework
  • orchestrate complex, AI-driven workflows
  • multi-tenant environment
  • design agent behavior as visual workflows
  • performance, reliability, safety of execution
  • ship a great end-user experience
  • distributed systems
  • platform engineering
  • AI-enabled architectures
  • enterprise-wide
  • AI pipelines
  • platform services
  • scalability, reliability, observability, and cost efficiency
  • enterprise data and AI foundations
  • long-term platform integrity
  • short-term delivery needs
  • local client + gateway + external APIs
  • security and compliance guardrails
  • local execution
  • least privilege, secrets handling, auditing, allowlists/deny lists
  • comprehensive testing
  • unit, integration, contract tests for tool schemas
  • end-to-end tests for common workflows
  • engineering excellence
  • code reviews, design docs, mentoring, incident follow-ups
  • operational standards
  • AI-powered services
  • agent workflows
  • internal builder platforms
  • GenAI-enabled capabilities
  • secure, governed environments
  • responsible AI patterns
  • guardrails
  • human-in-the-loop design
  • core agent orchestration components
  • UI → agent core logic → tool manager → local tools
  • local tool plugins
  • file read/list/edit, command execution, integrations
  • robust tool-call validation and execution
  • schema enforcement, parameter validation, retries, error handling, idempotency, and safe defaults
  • enterprise APIs via HTTPS
  • Microsoft Graph
  • user lookup, email/calendar actions
  • productivity scenarios
  • complex, high-risk, or ambiguous initiatives
  • rigorous code reviews, operational reviews, and architectural discussions
  • CI/CD, reliability engineering, and platform observability practices
  • performance, reliability, and cost benchmarks
  • product, governance, enterprise data, and infrastructure teams
  • complex technical concepts
  • business-impact narratives
  • consensus across senior engineers and engineering leaders
  • multi-team roadmaps
  • architectural fragmentation
  • emerging technical leaders
  • architectural maturity
  • reusable frameworks, patterns, and internal documentation
  • scale impact beyond your direct team
  • highly available, distributed systems
  • enterprise-scale software systems
  • inception to production
  • Python (building libraries/services/tools)
  • packaging/dependencies, logging, and performance troubleshooting
  • OAuth2/OIDC authentication
  • scope/permission models
  • schema/contract frameworks
  • JSON Schema, OpenAPI, Pydantic, protobuf
  • backward-compatible tool evolution
  • observability
  • structured logging, metrics, traces
  • debugging distributed flows across client + gateway
  • AI/ML ecosystems in production environments
  • architectural judgment

Other signals

  • agentic AI framework
  • orchestrate complex, AI-driven workflows
  • platform-first ecosystem that centralizes enterprise data, provides AI foundations, and delivers intuitive AI solutions