Software Developer (agentic Evaluation)

Autodesk Autodesk · Enterprise · Toronto, ON +4

Software Developer role focused on building and evaluating agentic AI systems for developer productivity, including automated testing and workflow optimization. Requires expertise in Python, ML frameworks, LLMs for software understanding, and AI evaluation methodologies.

What you'd actually do

  1. Develop and orchestrate multi-agent AI systems for automated test generation, test execution, and end-to-end development workflow optimization using frameworks like LangGraph, AutoGen, or the Anthropic Agent SDK (Claude Code)
  2. Design and implement agentic workflows that coordinate multiple AI agents to autonomously drive test automation across UI, API, integration, and system levels, from test case synthesis to result evaluation, ensuring seamless integration with existing developer tools and MCP-compatible services
  3. Build evaluation frameworks and custom benchmarks for agentic systems, including comparisons of AI agents against commercial solvers, using tools like AgentBench and Langfuse
  4. Evaluate MCP server and tool performance across agentic pipelines, measuring latency, accuracy, context fidelity, and end-to-end task completion rates

Skills

Required

  • Python
  • ML frameworks (PyTorch, Transformers, scikit-learn)
  • Large Language Models applied to software understanding or test generation
  • AI evaluation methodologies and metrics for agentic task completion and test quality
  • statistical analysis
  • experimental design

Nice to have

  • software engineering or QA
  • test automation frameworks (e.g., Playwright, Selenium, Pytest, Appium)
  • CI/CD pipelines
  • benchmarks that compare AI agents against commercial or domain-specific solvers
  • MCP (Model Context Protocol)
  • LangGraph
  • AutoGen
  • Anthropic Agent SDK / Claude Code
  • vision-language models or multi-modal AI
  • Azure AI Foundry/ML or AWS cloud ML platforms

What the JD emphasized

  • rigorously evaluate intelligent agentic systems
  • benchmarking AI agents against commercial solvers
  • Evaluate MCP server and tool performance across agentic pipelines
  • Knowledge of AI evaluation methodologies and metrics for agentic task completion and test quality
  • Experience designing benchmarks that compare AI agents against commercial or domain-specific solvers

Other signals

  • AI agents
  • evaluation frameworks
  • developer productivity