Staff AI Engineer | Canada | Remote

Grafana Labs · Data AI · Canada, United States · Remote · Sales Operations

Staff AI Engineer to own the AI agent infrastructure and automation platform for GTM teams, building multi-agent architectures, LLM integrations, and backend services. The role involves shipping production systems, defining technical direction, and partnering with various teams to create scalable, self-service automation. Responsibilities include end-to-end development of multi-agent systems, building agentic skills, implementing observability, establishing governance, integrating AI models with business systems, architecting RAG flows, and designing/deploying workflows with CI/CD standards. Requires 8+ years of software engineering experience, 2+ years applying LLMs to production, proficiency in Python/JavaScript, experience with LLM frameworks (prompt engineering, RAG, tool use, evaluation), and building/operating multi-agent systems at scale. Familiarity with GCP, BigQuery, serverless/containerized services, and LLM failure modes is essential. The role emphasizes diagnosing business problems, thinking in workflows, identifying high-leverage opportunities, and using AI-assisted development tools.

What you'd actually do

Own end-to-end development of multi-agent AI systems, from architecture and implementation through testing, deployment, and ongoing operation
Build modular, composable agentic systems using orchestration frameworks (LangChain, CrewAI, Anthropic MCP, or similar) that operate 24/7 across teams
Implement observability and feedback loops including logging, performance metrics, prompt iteration, model evaluation, and cost management
Architect data flows for retrieval-augmented generation (RAG), connecting LLMs to internal knowledge bases, customer data, and real-time business context
Partner with RevOps, Demand Generation, Regional Marketing, and SDR teams to scope high-impact automation problems, identify bottlenecks, and build solutions with measurable business outcomes

Skills

Required

Python
JavaScript/Node.js
Git-based workflows
code review practices
testing discipline
LLM frameworks and patterns
prompt engineering
RAG
function calling/tool use
structured output parsing
evaluation
multi-agent systems at scale
agent decomposition
orchestration patterns
state management
production monitoring
Google Cloud Platform
BigQuery
serverless/containerized services
LLM failure modes
production mitigations
confidence thresholds
fallback logic
human escalation
cost/latency management
AI-assisted development tools

Nice to have

LangChain
CrewAI
Anthropic MCP
n8n
Workato
GCP Cloud Functions
Cloud Run
Grafana's cloud infrastructure

What the JD emphasized

own the AI agent infrastructure and automation platform
ship production systems
own the technical direction
identify the highest-leverage problems
deliver end-to-end with minimal direction
2+ years hands-on experience applying LLMs/AI to production workflows, not just prototypes
Experience building and operating multi-agent systems at scale

Other signals

build multi-agent architectures
LLM integrations
backend services that connect AI models to internal and third-party data platforms
ship production systems
own the technical direction
identify the highest-leverage problems
design the solutions
ship them
define the technical direction for the automation platform
partner with Data Engineering, GTM Systems, Field Operations, and GTM leadership to build scalable, self-service automation
eliminates manual work and drives operational efficiency
Own end-to-end development of multi-agent AI systems
architecture and implementation through testing, deployment, and ongoing operation
Build modular, composable agentic systems using orchestration frameworks
Develop reusable agentic skills that agents invoke across interfaces
Implement observability and feedback loops including logging, performance metrics, prompt iteration, model evaluation, and cost management
Establish governance and compliance standards for AI workflows including access controls, audit trails, PII handling, and human-in-the-loop escalation paths
Build MCP servers, APIs, CLIs, and microservices connecting AI models to business systems
Architect data flows for retrieval-augmented generation (RAG)
connecting LLMs to internal knowledge bases, customer data, and real-time business context
Build serverless or containerized services that scale with usage and integrate with Grafana's cloud infrastructure
Partner with RevOps, Demand Generation, Regional Marketing, and SDR teams to scope high-impact automation problems
identify bottlenecks, and build solutions with measurable business outcomes
Design and deploy workflows using orchestration tools with CI/CD, testing, and production reliability standards
Build systems designed for self-service with documentation, playbooks, and enablement materials
invest heavily in developer productivity
access to AI coding assistants
encourage pragmatic AI-assisted development paired with strong code review and quality standards
8+ years of software engineering experience with depth in backend development, systems integration, or data/analytics engineering
2+ years hands-on experience applying LLMs/AI to production workflows, not just prototypes
Strong proficiency in Python and JavaScript/Node.js with Git-based workflows, code review practices, and testing discipline
Hands-on experience with LLM frameworks and patterns including prompt engineering, RAG, function calling/tool use, structured output parsing, and evaluation
Experience building and operating multi-agent systems at scale including agent decomposition, orchestration patterns, state management, and production monitoring
You diagnose business problems before writing code
You think in workflows and outcomes, not just functions
Deep familiarity with Google Cloud Platform, BigQuery, and serverless/containerized services
Understanding of LLM failure modes and production mitigations including confidence thresholds, fallback logic, human escalation, and cost/latency management
Proven ability to identify high-leverage problems, push back on low-impact requests, and deliver end-to-end with minimal direction
Fluent with AI-assisted development tools
You use AI to build AI systems
Clear technical communicator

Read full job description

Grafana Labs is a remote-first, open-source powerhouse. There are more than 20M users of Grafana, the open source visualization tool, around the globe, monitoring everything from beehives to climate change in the Alps. The instantly recognizable dashboards have been spotted everywhere from a NASA launch and Minecraft HQ to Wimbledon and the Tour de France. Grafana Labs also helps more than 3,000 companies -- including Bloomberg, JPMorgan Chase, and eBay -- manage their observability strategies with the Grafana LGTM Stack, which can be run fully managed with Grafana Cloud or self-managed with the Grafana Enterprise Stack, both featuring scalable metrics (Grafana Mimir), logs (Grafana Loki), and traces (Grafana Tempo).

We’re scaling fast and staying true to what makes us different: an open-source legacy, a global collaborative culture, and a passion for meaningful work. Our team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything we do.

You may not meet every requirement, and that’s okay. If this role excites you, we’d love you to raise your hand for what could be a truly career-defining opportunity.

This is a remote opportunity and we would be interested in applicants from Canada time zones only at this time.

Staff AI Engineer

**The Opportunity: **

Grafana's Revenue Operations organization is looking for a Staff AI Engineer to own the AI agent infrastructure and automation platform that powers our GTM teams. You'll build multi-agent architectures, LLM integrations, and backend services that connect AI models to internal and third-party data platforms. You'll ship production systems that teams depend on daily.

This is a high-autonomy role where you own the technical direction. You'll identify the highest-leverage problems across Sales, Customer Success, and Marketing, design the solutions, and ship them. You'll define the technical direction for the automation platform—data models, API contracts, shared libraries, reference architectures—and partner with Data Engineering, GTM Systems, Field Operations, and GTM leadership to build scalable, self-service automation that eliminates manual work and drives operational efficiency.

What You’ll Be Doing:

Agentic Tool Development

Own end-to-end development of multi-agent AI systems, from architecture and implementation through testing, deployment, and ongoing operation
Build modular, composable agentic systems using orchestration frameworks (LangChain, CrewAI, Anthropic MCP, or similar) that operate 24/7 across teams
Develop reusable agentic skills that agents invoke across interfaces (Slack, dashboards, internal apps, CLIs)
Implement observability and feedback loops including logging, performance metrics, prompt iteration, model evaluation, and cost management
Establish governance and compliance standards for AI workflows including access controls, audit trails, PII handling, and human-in-the-loop escalation paths

Systems Integration & Backend Services

Build MCP servers, APIs, CLIs, and microservices connecting AI models to business systems (BigQuery, Slack, CRMs, email, calendars, analytics tools)
Architect data flows for retrieval-augmented generation (RAG), connecting LLMs to internal knowledge bases, customer data, and real-time business context
Build serverless or containerized services (GCP Cloud Functions, Cloud Run) that scale with usage and integrate with Grafana's cloud infrastructure

Automation & Workflow Manufacturing

Partner with RevOps, Demand Generation, Regional Marketing, and SDR teams to scope high-impact automation problems, identify bottlenecks, and build solutions with measurable business outcomes
Design and deploy workflows using orchestration tools (n8n, Workato, or custom platforms) with CI/CD, testing, and production reliability standards
Build systems designed for self-service with documentation, playbooks, and enablement materials that let partner teams operate independently

We invest heavily in developer productivity. You'll have access to AI coding assistants (Claude Code, Gemini CLI, OpenAI Codex, and others of your choice within security guidelines). We encourage pragmatic AI-assisted development paired with strong code review and quality standards.

What Makes You a Great Fit / Requirements:

8+ years of software engineering experience with depth in backend development, systems integration, or data/analytics engineering
2+ years hands-on experience applying LLMs/AI to production workflows, not just prototypes
Strong proficiency in Python and JavaScript/Node.js with Git-based workflows, code review practices, and testing discipline
Hands-on experience with LLM frameworks and patterns including prompt engineering, RAG, function calling/tool use, structured output parsing, and evaluation
Experience building and operating multi-agent systems at scale including agent decomposition, orchestration patterns (sequential chains, router/dispatcher, parallel fan-out), state management, and production monitoring
You diagnose business problems before writing code. You think in workflows and outcomes, not just functions.
Deep familiarity with Google Cloud Platform, BigQuery, and serverless/containerized services (Cloud Functions, Cloud Run)
Understanding of LLM failure modes and production mitigations including confidence thresholds, fallback logic, human escalation, and cost/latency management
Proven ability to identify high-leverage problems, push back on low-impact requests, and deliver end-to-end with minimal direction
Fluent with AI-assisted development tools (GitHub Copilot, Cursor, Claude Code). You use AI to build AI systems
Clear technical communicator—you can explain complex systems in simple terms to both engineers and business stakeholders

Bonus Points For:

Experience with frontend frameworks & tooling (React, Slack Block Kit, dashboard components) to build user-facing interfaces for AI tools
Familiarity with GTM platforms like Salesforce, HubSpot, Outreach, Gainsight, or similar CRM/sales engagement tools
Experience with vector databases or retrieval pipelines (Pinecone, Weaviate, ChromaDB, pgvector, or similar)
Prior work automating sales, customer success, or marketing workflows in a B2B SaaS environment
Experience with workflow automation platforms like n8n, Prefect, Clay, PhantomBuster, Apify, Dust, or similar tools
Familiarity with Model Context Protocol (MCP) or similar standards for connecting AI systems to data sources and tools
Exposure to observability tools for AI systems (LangSmith, Weights & Biases, custom logging/evaluation frameworks)
Experience working in Revenue Operations, GTM Analytics, or Sales Operations environments
Previous experience in open source or developer-focused SaaS companies—Grafana is built on OSS and we value engineers who share that DNA

Compensation & Rewards:

In Canada, the Base compensation range for this role is CAD 186,368 - CAD 230,000. Actual compensation may vary based on level, experience, and skillset as assessed throughout the interview process. All of our roles include Restricted Stock Units (RSUs), giving every team member ownership in Grafana Labs' success. We believe in shared outcomes—RSUs help us stay aligned and invested as we scale globally.

All of our roles include Restricted Stock Units (RSUs), giving every team member ownership in Grafana Labs' success. We believe in shared outcomes—RSUs help us stay aligned and invested as we scale globally.

*Compensation ranges are country specific. If you are applying for this role from a different location than listed above, your recruiter will discuss your specific market’s defined pay range & benefits at the beginning of the process.

Why You’ll Thrive at Grafana Labs:

**100% Remote, Global Culture - **As a remote-only company, we bring together talent from around the world, united by a culture of collaboration and shared purpose.
Scaling Organization – Tackle meaningful work in a high-growth, ever-evolving environment.
Transparent Communication – Expect open decision-making and regular company-wide updates.
Innovation-Driven – Autonomy and support to ship great work and try new things.
Open Source Roots – Built on community-driven values that shape how we work.
Empowered Teams – High trust, low ego culture that values outcomes over optics.
Career Growth Pathways – Defined opportunities to grow and develop your career.
Approachable Leadership – Transparent execs who are involved, visible, and human.
Passionate People – Join a team of smart, supportive folks who care deeply about what they do.
**In-Person onboarding **- We want you to thrive from day 1 with your fellow new ‘Grafanistas’ to learn all about what we do and how we do it.
Balance is Key - We operate a global annual leave policy of 30 days per annum. 3 days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect. *We will comply with local legislation where applicable.

Equal Opportunity Employer: We will recruit, train, compensate and promote regardless of race, religion, color, national origin, gender, disability, age, veteran status, and all the other fascinating characteristics that make us different and unique. We believe that equality and diversity builds a strong organization and we’re working hard to make sure that’s the foundation of our organization as we grow.

Grafana Labs may utilize AI tools in its recruitment process to assist in matching information provided in CVs to job postings. The recruitment team will continue to review inbound CVs manually to identify alignment with current openings.

#LI-Remote

_For information about how your personal data is used once you’ve applied to a job, check out our privacy policy. _