Staff Software Engineer - Grafana Cloud K6 | Usa | Remote

Grafana Labs Grafana Labs · Data AI · Canada, United States · Remote · R&D: Performance testing (k6)

Staff Software Engineer role focused on establishing and scaling a cross-team culture of engineering excellence by setting standards and guiding adoption of strong DevOps/SRE practices. The role will expand into broader application and product development leadership, contributing architectural and technical depth beyond operational excellence. The company encourages pragmatic AI-assisted development for prototyping, test generation, refactors, documentation, and incident follow-ups, and provides access to frontier models.

What you'd actually do

  1. Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
  2. Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
  3. Establish reliability frameworks such as SLIs/SLOs and error budgets, and use them to guide prioritization and engineering trade-offs.
  4. Provide visibility into system health through clear operational metrics and reliability reporting.
  5. Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.

Skills

Required

  • DevOps/SRE practices
  • operating and evolving production systems at scale
  • modern language programming (Python and Go preferred)
  • designing, building, and operating large-scale distributed systems
  • reliability engineering concepts (incident management, observability, failure modes)
  • test automation (performance and functional testing)
  • technical communication
  • interpersonal skills
  • modern software engineering processes and delivery practices
  • autonomy and ambiguity

Nice to have

  • containerized and cloud-native systems (Docker, Kubernetes, AWS)
  • observability tooling and platforms (Grafana stack)
  • Python, Go, JavaScript and/or Jsonnet
  • performance testing
  • AI-assisted development

What the JD emphasized

  • strong DevOps/SRE practices
  • reliability
  • availability
  • operational excellence
  • large-scale, distributed cloud systems
  • modern AI coding assistants
  • frontier models