Principal Software Engineer, Agentic AI Devops

Amazon Amazon · Big Tech · CA, BC +1 · Software Development

This Principal Software Engineer role focuses on building agentic AI solutions for AWS DevOps, aiming to accelerate incident response and improve operational efficiency for production systems. The role involves working with information retrieval systems, knowledge graphs, and LLMs to create a frontier agent that resolves incidents and learns from them for systemic improvements.

What you'd actually do

  1. building agentic AI solutions to accelerate incident response and drive continuous operational improvements for production systems for Amazon-internal and external customers
  2. help lead the strategy for our application map (topology) service
  3. work on a frontier agent, utilizing information retrieval systems, knowledge graphs, and large language models (LLMs)
  4. work on solutions that use generative AI technologies to improve the developer experience for builders who develop and operate applications on AWS
  5. help builders resolve incidents quickly in live systems, and then learn from those incidents to make systemic improvements

Skills

Required

  • 10+ years of software development experience
  • Experience partnering with product or program management teams
  • Experience managing multiple concurrent programs, projects and development teams in an Agile environment

Nice to have

  • Experience in communicating with users, other technical teams, and senior leadership to collect requirements, describe software product features, technical designs, and product strategy

What the JD emphasized

  • 10+ years of software development experience
  • Experience partnering with product or program management teams
  • Experience managing multiple concurrent programs, projects and development teams in an Agile environment

Other signals

  • building agentic AI solutions
  • accelerate incident response
  • drive continuous operational improvements
  • frontier agent
  • information retrieval systems
  • knowledge graphs
  • large language models (LLMs)
  • generative AI technologies
  • improve the developer experience
  • resolve incidents quickly in live systems
  • learn from those incidents to make systemic improvements