Staff AI Engineer | US | Remote

Grafana Labs Grafana Labs · Data AI · Canada, United States · Remote · Sales Operations

Staff AI Engineer to own the AI agent infrastructure and automation platform for GTM teams. Build multi-agent architectures, LLM integrations, and backend services connecting AI models to internal/third-party data platforms. Ship production systems, define technical direction, and partner with teams to build scalable, self-service automation.

What you'd actually do

  1. Own end-to-end development of multi-agent AI systems, from architecture and implementation through testing, deployment, and ongoing operation
  2. Build modular, composable agentic systems using orchestration frameworks (LangChain, CrewAI, Anthropic MCP, or similar) that operate 24/7 across teams
  3. Build MCP servers, APIs, CLIs, and microservices connecting AI models to business systems (BigQuery, Slack, CRMs, email, calendars, analytics tools)
  4. Partner with RevOps, Demand Generation, Regional Marketing, and SDR teams to scope high-impact automation problems, identify bottlenecks, and build solutions with measurable business outcomes
  5. Implement observability and feedback loops including logging, performance metrics, prompt iteration, model evaluation, and cost management

Skills

Required

  • Python
  • JavaScript/Node.js
  • Git-based workflows
  • code review practices
  • testing discipline
  • LLM frameworks and patterns
  • prompt engineering
  • RAG
  • function calling/tool use
  • structured output parsing
  • evaluation
  • multi-agent systems
  • agent decomposition
  • orchestration patterns
  • state management
  • production monitoring
  • Google Cloud Platform
  • BigQuery
  • serverless/containerized services
  • LLM failure modes
  • production mitigations
  • confidence thresholds
  • fallback logic
  • human escalation
  • cost/latency management
  • AI-assisted development tools

Nice to have

  • LangChain
  • CrewAI
  • Anthropic MCP
  • n8n
  • Workato
  • Grafana's cloud infrastructure
  • GitHub Copilot
  • Cursor
  • Claude Code
  • OpenAI Codex

What the JD emphasized

  • own the AI agent infrastructure and automation platform
  • ship production systems
  • own the technical direction
  • apply LLMs/AI to production workflows
  • building and operating multi-agent systems at scale
  • identify high-leverage problems

Other signals

  • build multi-agent architectures
  • LLM integrations
  • backend services that connect AI models to internal and third-party data platforms
  • ship production systems
  • apply LLMs/AI to production workflows