Staff Platform Monitoring Engineer

Databricks Databricks · Data AI · Amsterdam, Netherlands · Support

Databricks is seeking a Platform Monitoring Engineer to join their Platform Monitoring Team. This role focuses on platform reliability, incident response, and observability for the Databricks Data and AI infrastructure platform. Responsibilities include leading incident investigations, conducting root cause analysis, designing alerting and observability workflows, and building automation tools. Requires experience with SRE/DevOps, cloud providers, containerization, and monitoring tools like ELK, Prometheus, and Grafana.

What you'd actually do

  1. Lead platform incident investigation, coordinating cross-functional teams through rapid detection, mitigation, and resolution to minimize customer impact.
  2. Conduct thorough post-incident root cause analysis across infrastructure, services, and cloud providers to identify systemic patterns and prevent future occurrences.
  3. Design and implement customer-focused alerting pipelines and end-to-end observability workflows to enhance detection coverage and reduce mean time to detection.
  4. Build automation tools, establish reusable monitoring patterns, and resolve reliability gaps that directly impact customer experience.

Skills

Required

  • SRE
  • DevOps
  • Production Engineering
  • AWS
  • Azure
  • GCP
  • Docker
  • Kubernetes
  • ELK
  • Prometheus
  • Grafana
  • PagerDuty
  • Python
  • incident response
  • root cause analysis
  • observability
  • monitoring
  • logging
  • alerting

What the JD emphasized

  • Minimum of 5 years of experience as an SRE, DevOps Engineer, Production Engineer, or similar role.
  • Production-level experience with at least one major cloud provider (AWS, Azure, GCP) and proficiency in container and orchestration technologies (Docker, Kubernetes).
  • Hands-on experience with monitoring, logging, and alerting tools such as ELK, Prometheus, Grafana, PagerDuty, etc. Ability to architect monitoring solutions that correlate metrics, logs, and traces.
  • Strong proficiency in Python or similar languages with the ability to build production-quality automation tools.
  • Experience owning critical phases of the incident lifecycle from detection through resolution and post-mortem analysis in demanding production environments.