Senior Staff Software Engineer - Agentic Automation

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Staff Software Engineer to own engineering efforts for NVIDIA enterprise systems, transforming support into AI-infused automated resolution systems using LLM-based agents, tool calling, RAG, and orchestration frameworks. Requires full-stack experience, strong systems thinking, and incident management skills.

What you'd actually do

  1. Design and implement agentic AI workflows using LLM-based agents, tool calling, RAG patterns, and orchestration frameworks. Push the boundaries of what AI-assisted operations can achieve.
  2. Build robust integrations and automation pipelines across ServiceNow, identity management, monitoring platforms, and enterprise SaaS. Own the full stack from infrastructure to user facing tools.
  3. Triage and resolve Enterprise issues with a focus on automation and improving mitigation and resolution times
  4. Manage and troubleshoot Enterprise scale collaboration, productivity, AI and Infrastructure systems.
  5. Trace and root cause complex, multi system failures. identify patterns in recurring tickets, and build automation or self-service solutions

Skills

Required

  • Bachelor’s or Master’s degree in Computer Science, Engineering, IT, or related field (or equivalent experience)
  • 12+ overall years experience in SRE, Enterprise Support or Devops
  • Experience with SaaS, hybrid cloud, AI/ML environments
  • Experience building production grade agentic workflows (e.g., multi-agent systems and MCP servers)
  • Software engineering fundamentals with deep experience in building products and operating large scale systems.
  • Expertise in two or more backend languages such as Go, Python, or Java with a track record of owning complex production systems.
  • Full stack engineering experience, including building user-facing web applications and operational dashboards using modern frontend frameworks such as React.js, along with backend APIs and data pipelines.
  • Systems thinker who naturally traces dependencies, considers second-order effects, and asks "why did this break?" not just "how do I fix it?"
  • Strong incident management skills: triage, root-cause analysis, blameless postmortems, pattern recognition
  • Expert troubleshooting across Enterprise hybrid stack such as Jira, Microsoft,OS [Apple,Linux, and Windows], Infrastructure systems such as compute,, AI, and storage.

What the JD emphasized

  • agentic AI workflows
  • LLM-based agents
  • tool calling
  • RAG patterns
  • orchestration frameworks
  • production grade agentic workflows
  • multi-agent systems
  • Systems thinker
  • Expert troubleshooting

Other signals

  • AI-infused automated resolution systems
  • agentic AI workflows
  • LLM-based agents
  • tool calling
  • RAG patterns
  • orchestration frameworks