Machine Learning Engineer Iii, Core Agents

Box Box · Enterprise · Redwood City, CA · Core Platform

Machine Learning Engineer III, Core Agents at Box. This role focuses on building and evaluating foundational AI agents for enterprise workflows, including DeepSearch, DeepResearch, Extract, and Compose. Responsibilities include developing techniques for intent detection, RAG, multi-agent orchestration, and establishing evaluation frameworks for agent quality and trustworthiness. The role collaborates with platform engineers for scalable deployment and product teams for use case translation.

What you'd actually do

  1. Build, evaluate, and evolve foundational agents such as DeepSearch, DeepResearch, Extract, and Compose.
  2. Develop techniques for intent detection, query understanding, ranking, and RAG to improve accuracy and relevance.
  3. Define metrics, evaluation pipelines, and benchmarks for agent quality, including precision/recall, factual grounding, and latency trade-offs.
  4. Research and implement best practices in retrieval, orchestration, and evaluation of multi-agent workflows.
  5. Collaborate with platform engineers to design core components that enable secure, reliable, and scalable deployment of agents.

Skills

Required

  • 3+ years of industry experience building or evaluating ML-powered systems
  • MS or PhD degree in Machine Learning, Computer Science, or a related field
  • Strong background in machine learning, information retrieval, or natural language processing
  • Proficiency with at least one programming language such as Python, Java, or Scala
  • Experience designing, training, and evaluating ML models in production
  • Familiarity with retrieval systems, ranking models, RAG pipelines, or intent classification

Nice to have

  • Advanced degree in computer science, machine learning, or related field
  • Hands-on experience with LangChain, LangGraph, or other agent frameworks
  • Familiarity with LLMs, embeddings, semantic search, indexing, and relevance optimization
  • Experience with cloud-based ML platforms such as Vertex AI, AWS Bedrock, or SageMaker
  • Experience with Kubernetes-based systems for deploying and scaling ML workloads
  • Research or applied experience in evaluation of generative AI systems (factuality, safety, grounding)

What the JD emphasized

  • building and evaluating AI agents that solve enterprise problems
  • designed or evaluated ML systems for search, ranking, RAG, or conversational AI
  • Experience designing, training, and evaluating ML models in production
  • evaluation of generative AI systems (factuality, safety, grounding)

Other signals

  • building foundational agents
  • designing, deploying, and operating AI agents
  • multi-step orchestrations
  • intent detection, ranking, evaluation, RAG, and multi-agent orchestration