Sr. Software Engineer - Distributed Systems

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Software Engineering

This role is for a Sr. Software Engineer on the Azure Event Grid's Engine team, focusing on designing and implementing distributed systems for a next-generation PubSub service. The role involves owning technical design, defining architectural patterns, mentoring engineers, and ensuring high availability and reliability of large-scale cloud messaging components. The team's mission is to build the data platform for the age of AI, powering data-first applications.

What you'd actually do

  1. Drives requirements and design by partnering with stakeholders (e.g., program managers, technical leads, architects) to define and refine requirements for messaging system features. Proactively leverages telemetry, customer feedback channels, and usage patterns to inform architectural decisions and shape the product roadmap. Establishes continuous feedback loops that measure customer value, reliability metrics, and operational health to guide future design iterations.
  2. Owns design and implementation of highly available, distributed messaging components in the cloud. Architects extensible, maintainable solutions that prioritize diagnosability, reliability, and resilience at scale. Champions coding best practices, design patterns, and reusable frameworks across the team. Ensures code is production-ready with minimal defects and mentors other engineers on code quality standards through hands-on guidance and thorough code reviews.
  3. Defines the test strategy for messaging system components, establishing clear quality gates and success criteria across unit, integration, and end-to-end tests. Drives test coverage improvements, removes obsolete tests, and identifies gaps in the testing framework. Leads efforts to integrate automation into CI/CD pipelines, ensuring that messaging reliability and performance are continuously validated under realistic workloads.
  4. Elevates engineering productivity by identifying tooling gaps in the development lifecycle for cloud messaging systems. Designs and builds internal tools, frameworks, and libraries that accelerate development, debugging, and operational workflows. Evaluates and advocates for open-source solutions where appropriate. Mentors the team on adopting modern tooling practices and fosters a culture of continuous improvement in developer experience.
  5. Leads incident response and operational excellence as a Designated Responsible Individual (DRI), monitoring messaging systems for degradation, downtime, or service interruptions. Drives rapid root-cause analysis and resolution for complex distributed systems issues, coordinates with cross-functional teams, and communicates status to stakeholders. Ensures response within SLA timeframes, authors post-incident reviews, and drives systemic improvements to prevent recurrence.​

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Nice to have

  • Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C#/.NET or equivalent backend languages OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C#/.NET or equivalent backend languages.
  • Proven experience designing and delivering large scale backend or distributed systems
  • Experience leading technical design for

What the JD emphasized

  • architecting distributed systems
  • large-scale services
  • highly available, distributed messaging components
  • reliability
  • resilience at scale
  • production-ready
  • quality gates
  • automation into CI/CD pipelines
  • incident response
  • operational excellence
  • complex distributed systems issues
  • SLA timeframes