Staff AI Engineer | Canada | Remote

Grafana Labs Grafana Labs · Data AI · Canada, United States · Remote · Sales Operations

Staff AI Engineer to own the AI agent infrastructure and automation platform for GTM teams, building multi-agent architectures, LLM integrations, and backend services. The role involves shipping production systems, defining technical direction, and partnering with various teams to create scalable, self-service automation. Responsibilities include end-to-end development of multi-agent systems, building agentic skills, implementing observability, establishing governance, integrating AI models with business systems, architecting RAG flows, and designing/deploying workflows with CI/CD standards. Requires 8+ years of software engineering experience, 2+ years applying LLMs to production, proficiency in Python/JavaScript, experience with LLM frameworks (prompt engineering, RAG, tool use, evaluation), and building/operating multi-agent systems at scale. Familiarity with GCP, BigQuery, serverless/containerized services, and LLM failure modes is essential. The role emphasizes diagnosing business problems, thinking in workflows, identifying high-leverage opportunities, and using AI-assisted development tools.

What you'd actually do

  1. Own end-to-end development of multi-agent AI systems, from architecture and implementation through testing, deployment, and ongoing operation
  2. Build modular, composable agentic systems using orchestration frameworks (LangChain, CrewAI, Anthropic MCP, or similar) that operate 24/7 across teams
  3. Implement observability and feedback loops including logging, performance metrics, prompt iteration, model evaluation, and cost management
  4. Architect data flows for retrieval-augmented generation (RAG), connecting LLMs to internal knowledge bases, customer data, and real-time business context
  5. Partner with RevOps, Demand Generation, Regional Marketing, and SDR teams to scope high-impact automation problems, identify bottlenecks, and build solutions with measurable business outcomes

Skills

Required

  • Python
  • JavaScript/Node.js
  • Git-based workflows
  • code review practices
  • testing discipline
  • LLM frameworks and patterns
  • prompt engineering
  • RAG
  • function calling/tool use
  • structured output parsing
  • evaluation
  • multi-agent systems at scale
  • agent decomposition
  • orchestration patterns
  • state management
  • production monitoring
  • Google Cloud Platform
  • BigQuery
  • serverless/containerized services
  • LLM failure modes
  • production mitigations
  • confidence thresholds
  • fallback logic
  • human escalation
  • cost/latency management
  • AI-assisted development tools

Nice to have

  • LangChain
  • CrewAI
  • Anthropic MCP
  • n8n
  • Workato
  • GCP Cloud Functions
  • Cloud Run
  • Grafana's cloud infrastructure

What the JD emphasized

  • own the AI agent infrastructure and automation platform
  • ship production systems
  • own the technical direction
  • identify the highest-leverage problems
  • deliver end-to-end with minimal direction
  • 2+ years hands-on experience applying LLMs/AI to production workflows, not just prototypes
  • Experience building and operating multi-agent systems at scale

Other signals

  • build multi-agent architectures
  • LLM integrations
  • backend services that connect AI models to internal and third-party data platforms
  • ship production systems
  • own the technical direction
  • identify the highest-leverage problems
  • design the solutions
  • ship them
  • define the technical direction for the automation platform
  • partner with Data Engineering, GTM Systems, Field Operations, and GTM leadership to build scalable, self-service automation
  • eliminates manual work and drives operational efficiency
  • Own end-to-end development of multi-agent AI systems
  • architecture and implementation through testing, deployment, and ongoing operation
  • Build modular, composable agentic systems using orchestration frameworks
  • Develop reusable agentic skills that agents invoke across interfaces
  • Implement observability and feedback loops including logging, performance metrics, prompt iteration, model evaluation, and cost management
  • Establish governance and compliance standards for AI workflows including access controls, audit trails, PII handling, and human-in-the-loop escalation paths
  • Build MCP servers, APIs, CLIs, and microservices connecting AI models to business systems
  • Architect data flows for retrieval-augmented generation (RAG)
  • connecting LLMs to internal knowledge bases, customer data, and real-time business context
  • Build serverless or containerized services that scale with usage and integrate with Grafana's cloud infrastructure
  • Partner with RevOps, Demand Generation, Regional Marketing, and SDR teams to scope high-impact automation problems
  • identify bottlenecks, and build solutions with measurable business outcomes
  • Design and deploy workflows using orchestration tools with CI/CD, testing, and production reliability standards
  • Build systems designed for self-service with documentation, playbooks, and enablement materials
  • invest heavily in developer productivity
  • access to AI coding assistants
  • encourage pragmatic AI-assisted development paired with strong code review and quality standards
  • 8+ years of software engineering experience with depth in backend development, systems integration, or data/analytics engineering
  • 2+ years hands-on experience applying LLMs/AI to production workflows, not just prototypes
  • Strong proficiency in Python and JavaScript/Node.js with Git-based workflows, code review practices, and testing discipline
  • Hands-on experience with LLM frameworks and patterns including prompt engineering, RAG, function calling/tool use, structured output parsing, and evaluation
  • Experience building and operating multi-agent systems at scale including agent decomposition, orchestration patterns, state management, and production monitoring
  • You diagnose business problems before writing code
  • You think in workflows and outcomes, not just functions
  • Deep familiarity with Google Cloud Platform, BigQuery, and serverless/containerized services
  • Understanding of LLM failure modes and production mitigations including confidence thresholds, fallback logic, human escalation, and cost/latency management
  • Proven ability to identify high-leverage problems, push back on low-impact requests, and deliver end-to-end with minimal direction
  • Fluent with AI-assisted development tools
  • You use AI to build AI systems
  • Clear technical communicator