Data Domain Architect Senior Associate - Agentic AI Evaluation & Annotation

JPMorgan Chase JPMorgan Chase · Banking · Bengaluru, Karnataka, India · Consumer & Community Banking

This role focuses on designing, executing, and scaling evaluation and annotation programs for agentic AI systems, defining metrics, schemas, and quality frameworks for multi-step reasoning, tool use, and safe agent behavior. It involves building and maintaining datasets, training annotators, and ensuring evaluation outputs are consistent and actionable, with a strong emphasis on measuring and improving agentic AI performance.

What you'd actually do

  1. Define and operationalize evaluation metrics for agentic AI workflows, including task success, step-level correctness, tool-use quality, policy adherence, recovery behavior, escalation decisions, and safe failure outcomes.
  2. Build and maintain agent-specific gold sets, challenge sets, and regression suites to assess planning quality, action sequencing, grounding, compliance boundaries, hallucination risk, loop detection, and release readiness
  3. Design annotation schemas, rubrics, taxonomies, and labeling guidelines for evaluating agent trajectories across multi-turn, multi-tool, and workflow-based scenarios.
  4. Develop evaluation approaches for tool-using agents, including tool selection, tool-call precision and recall, argument correctness, response interpretation, and unnecessary or missing tool usage.
  5. Train, calibrate, and support annotators on agentic evaluation tasks, ensuring consistent application of schemas, rubrics, edge-case guidance, and quality expectations.

Skills

Required

  • Master’s degree in Computer Science, Data Science, Computational Linguistics, Human-Computer Interaction, Cognitive Science, AI/ML, or a related field.
  • 3+ years of experience supporting AI evaluation, annotation programs, ML-enabled products, LLM applications, conversational AI, workflow automation, or agentic AI systems.
  • Hands-on experience evaluating LLM-based or agentic systems, including multi-step reasoning, planning quality, tool use, task completion, grounding, or workflow execution.
  • Experience designing annotation schemas, evaluation rubrics, taxonomies, labeling guidelines, or grading standards for complex AI behaviors.
  • Demonstrated ability to define measurable evaluation criteria for agentic workflows, including task success, step correctness, tool-call quality, policy adherence, recovery behavior, and escalation decisions.
  • Experience training and calibrating annotators or reviewers on complex evaluation tasks, including rubric interpretation, edge-case resolution, adjudication, and quality feedback.
  • Experience assessing tool-calling or API-using AI systems, including tool selection, argument accuracy, action sequencing, and interpretation of tool outputs.
  • Working knowledge of agentic AI concepts such as planning, orchestration, tool invocation, context management, memory use, multi-turn execution, loop detection, and human handoff.
  • Practical prompt engineering experience for LLM or agent evaluation workflows, including instruction refinement, evaluator prompts, pre-labeling, and synthetic test case generation.
  • Hands-on Python experience for data analysis, cleaning, validation, automation, and evaluation result processing; experience using Git or similar version control tools.
  • Strong analytical, communication, and documentation skills, with the ability to translate complex agent behavior into observable, measurable, and repeatable evaluation decisions.

Nice to have

  • Experience evaluating autonomous, semi-autonomous, or tool-using agents in production or pre-production environments.
  • Experience building agent evaluation benchmarks, trajectory datasets, gold datasets, challenge sets, regression suites, or release-readiness test sets.
  • Experience evaluating enterprise copilots, workf

What the JD emphasized

  • evaluation frameworks for agentic AI systems
  • defining metrics, schemas, rubrics, and quality frameworks
  • multi-step reasoning, tool use, task completion, policy adherence, and safe agent behavior
  • annotation schemas, rubrics for agent trajectories, train annotators, lead calibration exercises
  • gold and challenge datasets
  • auditable, and actionable evaluation outputs
  • evaluating LLM-based or agentic systems
  • tool-calling behavior, planning quality, action sequencing, grounding, error recovery, and human-in-the-loop review workflows
  • measurement and improvement of agentic AI performance

Other signals

  • evaluation frameworks for agentic AI systems
  • defining metrics, schemas, rubrics, and quality frameworks
  • multi-step reasoning, tool use, task completion, policy adherence, and safe agent behavior
  • annotation schemas, rubrics for agent trajectories, train annotators, lead calibration exercises
  • gold and challenge datasets
  • auditable, and actionable evaluation outputs
  • evaluating LLM-based or agentic systems
  • tool-calling behavior, planning quality, action sequencing, grounding, error recovery, and human-in-the-loop review workflows
  • measurement and improvement of agentic AI performance