Software Engineer - Observability

Microsoft Microsoft · Big Tech · Dublin, D, Ireland · Software Engineering

Software Engineer for Microsoft's Azure Data team, focusing on the Observability Platform. The role involves designing, developing, and operating large-scale telemetry ingestion pipelines that handle massive data volumes (Exabytes daily) and trillions of signals. Responsibilities include building APIs, integrating ML-based anomaly detection, ensuring reliability and scalability, and participating in on-call rotations. The platform underpins observability across Azure, Office, Windows, and Xbox.

What you'd actually do

  1. Design, develop, and operate large-scale, multi-tenant telemetry ingestion pipelines and services (real-time and batch) to handle massive data volumes.
  2. Build and enhance APIs, tools, and subsystems for telemetry collection, routing, storage, and efficient data access.
  3. Integrate advanced capabilities (e.g., machine learning–based anomaly detection and data validation) to enhance platform intelligence and insights.
  4. Own core components of the ingestion and observability platform, driving continuous improvements in reliability, scalability, performance, and data quality.
  5. Implement robust monitoring, alerting, and diagnostics and ensure production services run reliably, including participation in on-call rotations and incident response.

Skills

Required

  • Software development with demonstrated experience shipping products or services
  • Solid understanding of data structures, algorithms, and system design fundamentals
  • Strong problem-solving and analytical skills, with a structured approach to software design
  • Ability to collaborate effectively in a cross-functional team environment
  • Strong communication skills

Nice to have

  • Experience with cloud platforms (Azure, AWS, or Google Cloud)
  • Experience building high-performance, scalable, and high-throughput systems
  • Experience with Service Fabric, AKS, or Azure DevOps
  • Experience debugging and resolving complex production issues
  • Experience working with AI/ML concepts or systems

What the JD emphasized

  • trillions of signals
  • Exabyte of data
  • global scale
  • reliability
  • scalability
  • performance
  • data quality
  • machine learning–based anomaly detection

Other signals

  • trillions of signals
  • Exabyte of data
  • global scale
  • real-time insights
  • telemetry ingestion pipelines