(ind) Staff, Software Engineer

Walmart Walmart · Retail · Chennai, India

The Staff Software Engineer role focuses on managing the availability and performance of global sites, specifically in the context of AI agentic frameworks. The role involves active, autonomous oversight of digitalization, managing 'human-in-the-loop' systems where AI agents plan and execute tasks, ensuring alignment with business goals, security, and safety. Key responsibilities include prompt engineering, setting up evaluation frameworks for AI agent performance, and incident management for critical systems.

What you'd actually do

  1. Drives the execution of multiple business plans and projects by identifying customer and operational needs; developing and communicating business plans and priorities; removing barriers and obstacles that impact performance; providing resources; identifying performance standards; measuring progress and adjusting performance accordingly; developing contingency plans; and demonstrating adaptability and supporting continuous learning.
  2. Provides supervision and development opportunities for associates by selecting and training; mentoring; assigning duties; building a team-based work environment; establishing performance expectations and conducting regular performance evaluations; providing recognition and rewards; coaching for success and improvement; and promoting a belonging mindset in the workplace.
  3. Promotes and supports company policies, procedures, mission, values, and standards of ethics and integrity by training and providing direction to others in their use and application; ensuring compliance with them; and utilizing and supporting the Open Door Policy.
  4. Ensures business needs are being met by evaluating the ongoing effectiveness of current plans, programs, and initiatives; consulting with business partners, managers, co-workers, or other key stakeholders; soliciting, evaluating, and applying suggestions for improving efficiency and cost-effectiveness; and participating in and supporting community outreach events.
  5. The TDO is responsible for the availability and performance of our global sites. The TDO will take command and control of Major Incidents focusing on restoration by identifying and coordinating with appropriate resources through all the phases of triage, restoration and validation.

Skills

Required

  • Incident management skills
  • Methodical and systematic problem solving approach
  • Experience investigating, analysing and troubleshooting large scale enterprise systems
  • Understanding of Unix/Linux systems
  • Experience working with and developing enterprise monitoring/tooling solutions like Grafana, Prometheus, Kibana, Splunk, Graphite, Dynatrace, catchpoint
  • Working knowledge of one or more cloud technologies such as AZURE, GCP and OpenStack
  • Expert verbal and written communication skills
  • Demonstrate excellent judgement in decision making
  • Strong focus on collecting and inferring metrics
  • Excellent communication skills

Nice to have

  • Prompt Engineering
  • Evaluation frameworks

What the JD emphasized

  • Agentic AI framework—requires a shift from passive monitoring to active, autonomous oversight of digitalisation.
  • The core requirement is managing "human-in-the-loop" systems where AI agents plan and execute tasks, while the TDO ensures alignment with business goals, security, and safety.
  • Prompt Engineering and Evaluation: Proficiency in designing prompts that enable agents to act reliably and setting up evaluation frameworks for monitoring performance.

Other signals

  • managing human-in-the-loop systems
  • AI agents plan and execute tasks
  • Prompt Engineering and Evaluation
  • setting up evaluation frameworks for monitoring performance