​​senior Site Reliability Engineer​

Microsoft Microsoft · Big Tech · IN · Software Engineering

Senior Site Reliability Engineer for Azure Data engineering team, focusing on building and maintaining Microsoft's operational Database systems, including Azure DocumentDB. The role involves ensuring service stability, performance, and manageability through software improvements, telemetry, alerting, and automation.

What you'd actually do

  1. Identify opportunities and drive the design and implementation of end-to-end telemetry, alerting, self-healing and automation capabilities to improve service health, manageability, and reliability.
  2. Participate in on-call rotations and own, triage, investigate and resolve service issues with an emphasis on broad communications, learning & teaching throughout the process.
  3. Interact with customers / support representatives and communicate on a deeply technical level with product engineering and product management teams to evolve services.
  4. Own availability, performance, and supportability targets for the service.
  5. Author functional and technical documentation and remain current on relevant technologies and procedures.

Skills

Required

  • Software development
  • Complexity analysis
  • Scalable system design
  • Telemetry
  • Alerting
  • Self-healing
  • Automation
  • Troubleshooting
  • Debugging
  • KQL
  • Distributed service layers
  • C++
  • C#
  • Powershell
  • Python

Nice to have

  • Understanding of distributed systems
  • Understanding of networking

What the JD emphasized

  • 7+ years of experience with writing tools, automation / scripting (Powershell, Python or similar), programming (C++, C# or equivalent) and making enhancements in subcomponents within and around services/products to deliver and manage software in production.
  • 7+ years of troubleshooting/debugging experience: telemetry-based analysis (KQL or equivalent preferred), troubleshooting skills across network, hardware, and distributed service layers, with demonstrated ability to debug, fix, and optimize code.