ML and Agentic Systems Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

NVIDIA's Cosmos team is seeking an ML and Agentic Systems Engineer to build AI-native systems and agentic workflows across the ML lifecycle. The role focuses on creating the meta-layer for ML development, enabling AI agents to interact with code, data, experiments, and evaluations to accelerate ML processes. Responsibilities include designing agentic workflows, building AI-native systems, creating self-improving loops, owning large-scale Python/PyTorch codebases, and scaling evaluation platforms.

What you'd actually do

  1. Design and implement agentic workflows across the ML lifecycle, including data generation and curation, evaluation, debugging, training orchestration, and iteration.
  2. Build AI-native systems in which models and agents can interact with codebases, tools, experiments, and environments to improve developer and researcher productivity.
  3. Create self-improving loops where agents help generate data, surface failures, evaluate outputs, and drive better decisions across the system.
  4. Own and evolve large-scale Python and PyTorch codebases, turning fast-moving ideas into robust, modular, reusable software.
  5. Design and scale evaluation platforms that combine automated metrics, human feedback, and agent-driven analysis.

Skills

Required

  • Python
  • PyTorch
  • Machine Learning Systems
  • Software Platforms
  • Pipelines
  • Evaluation Systems
  • Developer Tooling
  • Workflow Automation
  • System Design
  • Testing
  • Packaging
  • Debugging
  • Collaborative Codebase Evolution
  • LLM Agency
  • Tool Use
  • Planning
  • Multi-step Workflows
  • Code Agents
  • Automation over Data and Experiments

Nice to have

  • Agent-based systems (coding, evaluation, data generation, triage, experimentation, orchestration)
  • Open-source ML contributions
  • Open-source Python contributions
  • Open-source developer tooling contributions
  • Context compression
  • Agent memory techniques
  • Agent safety
  • Agent identity (AuthN, AuthZ, IAM)

What the JD emphasized

  • Significant experience building machine learning systems and software platforms, not only models.
  • Expert-level Python skills
  • Deep familiarity with PyTorch
  • Experience building pipelines, evaluation systems, developer tooling, or workflow automation for ML at meaningful scale.
  • Strong agency in LLM-based systems, such as tool use, planning, multi-step workflows, code agents, or automation over data and experiments.
  • You have built agent-based systems that do real work: coding, evaluation, data generation, triage, experimentation, or orchestration.

Other signals

  • building agentic systems
  • AI-native software engineering
  • meta-layer of modern ML