Senior Software Engineer II - Applied AI and Evaluations (remote Eligible)

Smartsheet Smartsheet · Seattle · United States · Engineering - Developers

Senior Software Engineer II focused on Applied AI and Evaluations for Smartsheet's AI-powered work management platform (SmartAssist). The role involves owning agent quality end-to-end, including diagnosis, improvement, and validation across orchestrators and subagents. Responsibilities include identifying failure modes, driving quality improvements through prompt/context engineering and RAG tuning, and extending the evaluation framework. This is a deeply technical role at the intersection of LLM evaluation, prompt engineering, and RAG, not a traditional QA role.

What you'd actually do

  1. Own agent quality end-to-end: diagnosis, improvement, and validation across SmartAssist's orchestrator and subagents
  2. Identify failure modes across quality dimensions factual accuracy, completeness, tone, actionability, and latency and prioritize what to fix
  3. Drive quality improvements through prompt engineering, context engineering, and RAG retrieval tuning
  4. Extend and mature our evaluation framework: scorers, golden datasets, regression gates, and online evaluation for production traffic
  5. Close the feedback loop ensure that every change has a measurable, attributable quality signal

Skills

Required

  • 8+ years of software engineering experience
  • 2+ years working directly with LLMs in production
  • Deep, hands-on experience with prompt engineering and context engineering
  • Strong working knowledge of RAG architectures
  • Experience building or extending LLM evaluation frameworks
  • Fluency in agent system design
  • Strong Python skills
  • Experience working in data-heavy environments (Databricks, Delta tables, or equivalent)
  • Ability to communicate complex quality findings (written and verbal)
  • Strong cross-functional judgment
  • A bias for clarity in ambiguous situations

Nice to have

  • Experience with MLflow or similar experiment tracking platforms
  • Familiarity with CI-integrated evaluation pipelines
  • Experience with multi-agent orchestration frameworks
  • Prior work in an Applied AI or LLMOps function within a product company

What the JD emphasized

  • quality is the critical frontier
  • deeply technical
  • high-autonomy
  • intersection of LLM evaluation, prompt and context engineering, and retrieval-augmented generation
  • diagnose why our agents fail
  • design the systems that catch regressions
  • drive measurable improvements
  • shipped evaluation infrastructure
  • building toward a mature Agent Development Lifecycle (ADLC)
  • 8+ years of software engineering experience, with at least 2 years working directly with LLMs in production
  • Deep, hands-on experience with prompt engineering and context engineering
  • Strong working knowledge of RAG architectures
  • Experience building or extending LLM evaluation frameworks
  • Fluency in agent system design
  • Strong Python skills
  • Ability to communicate complex quality findings
  • Strong cross-functional judgment
  • A bias for clarity in ambiguous situations
  • Delivered measurable, validated quality improvement
  • Expanded evaluation coverage
  • Established a repeatable quality improvement methodology

Other signals

  • LLM evaluation
  • agent quality
  • RAG tuning
  • prompt engineering
  • evaluation framework