Software Engineer II

Microsoft Microsoft · Big Tech · Hyderabad, TS, IN · Software Engineering

Software Engineer II to build and operate AI Agents as Service for cloud operations, focusing on agent capabilities, orchestration, evaluation, safety, and reliability in production environments.

What you'd actually do

  1. Take ownership of important areas of the Azure SRE Agent Platform, including agent capabilities, orchestration, evaluation, user experiences on different form factors and supporting platform services
  2. Build and iterate on agentic systems, including tools, planning and execution loops, evaluations, and safety mechanisms
  3. Design and ship reliable capabilities that improve incident detection, diagnosis, mitigation, and operational learning
  4. Use telemetry, experiments, evaluations, and user feedback to guide iteration and investment
  5. Contribute to resilient, observable systems that operate safely and effectively in production

Skills

Required

  • Bachelor’s or Master’s degree in Computer Science, or equivalent practical experience.
  • 4+ years of experience building production software using one or more modern programming languages such as C#, C++, Go, Java or Python.
  • Strong understanding of Generative AI & software engineering fundamentals, data structures, and problem-solving.
  • Ability to learn new technologies quickly and adapt to deliver customer and business impact.

Nice to have

  • Hands-on experience of building and operating LLM powered agentic systems in production, with direct ownership over quality, reliability, and iterations
  • 3+ years of experience building and operating cloud platforms or distributed services, with depth in service architecture, deployment, and observability
  • Strong product mindset with a track record of owning ambiguous problem spaces and driving them to high-quality outcomes
  • Solid engineering fundamentals, including systems design, performance, and debugging in complex production environments
  • Track record of designing, running, and optimizing evaluations for agentic systems, including tools, prompts, and agent loops
  • Expertise with Kubernetes, container orchestration, or cloud-native infrastructure is a strong plus
  • Experience contributing to or leading open-source projects at scale is a plus

What the JD emphasized

  • production software
  • agentic systems
  • production
  • evaluations
  • production environments

Other signals

  • AI Agents as Service
  • agentic systems
  • production issues
  • virtual SRE teammates
  • recommend or performs fixes
  • quality, safety, security, enterprise scale and real-world impact
  • full lifecycle of agentic systems in production
  • core capabilities that shape agent behavior
  • tool design, planning and execution loops, orchestration, evaluation, and safety guardrails
  • operational foundations
  • observability, progressive delivery, reliability engineering, and live-site learning
  • user experience for these agents
  • full stack Software Engineer II
  • next generation of agentic systems for cloud operations
  • product quality, end-to-end ownership
  • exciting prototype from a system people trust during critical moments
  • high autonomy in a highly agile environment
  • short cycles, thin slices, feature flags, progressive delivery, and constant learning
  • strong owner’s mindset and a strong bias for action
  • ownership of ambiguous problems
  • adopt modern science research, engineering patterns & practices
  • move quickly, learn from production
  • continuously raise the quality bar as they ship
  • Take ownership of important areas of the Azure SRE Agent Platform
  • agent capabilities, orchestration, evaluation, user experiences on different form factors and supporting platform services
  • Build and iterate on agentic systems
  • tools, planning and execution loops, evaluations, and safety mechanisms
  • Design and ship reliable capabilities that improve incident detection, diagnosis, mitigation, and operational learning
  • Use telemetry, experiments, evaluations, and user feedback to guide iteration and investment
  • Contribute to resilient, observable systems that operate safely and effectively in production
  • Partner closely with engineers, SREs, and product counterparts to turn ambiguous problems into high-quality shipped solutions
  • Participate in debugging, live-site learning, and post-incident hardening to continuously improve system quality
  • Contribute to architecture, engineering standards, and development practices across the team
  • Hands-on experience of building and operating LLM powered agentic systems in production
  • direct ownership over quality, reliability, and iterations
  • building and operating cloud platforms or distributed services
  • depth in service architecture, deployment, and observability
  • Strong product mindset with a track record of owning ambiguous problem spaces and driving them to high-quality outcomes
  • Solid engineering fundamentals, including systems design, performance, and debugging in complex production environments
  • Track record of designing, running, and optimizing evaluations for agentic systems
  • tools, prompts, and agent loops
  • Expertise with Kubernetes, container orchestration, or cloud-native infrastructure is a strong plus
  • Experience contributing to or leading open-source projects at scale is a plus