Staff AI Engineer - Grafana Ops, Ai/ml | Usa | Remote

Grafana Labs Grafana Labs · Data AI · Canada, United States · Remote · R&D: Ops

Staff AI Engineer focused on building and shipping AI-powered features for observability tools, specifically reducing toil and assisting users with incident response through LLM and agentic workflows. The role emphasizes rapid experimentation, iteration, and collaboration to deliver scalable, production-ready AI solutions.

What you'd actually do

  1. Build and deliver AI solutions: Take ownership of developing high-performance AI features to help users detect, triage, and resolve incidents using observability data and tools.
  2. Rapid experimentation and iteration: Implement a highly iterative process where you quickly prototype, test, and validate with real users, including shipping and evolving LLM- or agent-powered workflows for incident lifecycle management and automated analysis tasks.
  3. Collaborate cross-functionally: Work with data analysts, product managers, and designers to shape AI-driven product features, including integration of agentic components with internal tools, alerting systems, runbooks, and developer workflows.
  4. Utilize AI tools effectively: Use AI and automation tools to enhance both product functionality and your own development workflows.
  5. Ownership and impact: Take full ownership of the AI solutions you develop, ensuring they are not only innovative but also scalable, maintainable, and aligned with real user workflows.

Skills

Required

  • Experience with LLMs, prompt engineering, and building applications powered by GenAI.
  • Proven track record of delivering software that made it into production and is actively used by users.
  • Strong engineering skills: Solid experience building production software systems (backend and / or full stack).
  • AI experience with a practical mindset.
  • Quick iteration and experimentation.
  • Proven initiative.
  • Collaborative attitude.
  • Exposure to working in cloud-native environments (e.g., AWS, GCP, Azure).

Nice to have

  • Experience with agentic components
  • Experience with observability data and tools

What the JD emphasized

  • shipping and scaling impactful features
  • shipping and evolving LLM- or agent-powered workflows
  • Proven track record of delivering software that made it into production and is actively used by users.

Other signals

  • AI-driven features
  • LLM- or agent-powered workflows
  • shipping and scaling impactful features
  • AI solutions
  • GenAI applications