Senior AI Platform Engineer

NVIDIA NVIDIA · Semiconductors · Yokneam, Israel

NVIDIA is seeking a Senior AI Platform Engineer to design and build a core platform for autonomous AI agents, including their reasoning engine, tool orchestration, and runtime infrastructure. The role involves developing microservices, owning features end-to-end, and ensuring platform reliability through observability and quality tracking. The platform aims to accelerate engineering development processes using AI agentic workflows.

What you'd actually do

  1. Design and build the core platform that powers an autonomous AI agent — including its reasoning engine, tool orchestration, and the runtime infrastructure it operates on
  2. Develop and evolve the Micro-services ecosystem that gives the agent its capabilities — from knowledge retrieval and log analysis to code execution and workflow automation
  3. Own features end-to-end: from requirements analysis and architecture, through implementation, to production deployment and iteration based on real usage
  4. Instrument, evaluate, and improve the platform's reliability — build observability, track quality, and feed signals back into the system to make the agent more effective over time
  5. Collaborate with engineering teams across the organization to identify high-impact workflows and translate them into AI-assisted automation that boosts developer productivity

Skills

Required

  • B.Sc. in Computer Science, Computer Engineering, or a related field
  • 5+ years of relevant experience
  • Solid system-level understanding with experience designing and delivering production services
  • Ability to architect solutions, guide AI tools effectively, and reason about system behavior end-to-end
  • Familiarity with containerization and orchestration (Docker, Kubernetes)
  • Understanding of REST APIs, microservice architectures, and distributed systems
  • Ability to learn complex concepts in a fast-paced environment

Nice to have

  • Familiarity with Kubernetes operators, Helm charts, and cluster management
  • Experience with LLM application development — prompt engineering, agentic frameworks (ReAct, tool-use), or RAG pipelines
  • Hands-on experience with FastAPI, async Python, or similar modern Python web frameworks
  • Experience with vector databases, semantic search, or embedding models
  • Knowledge of OAS (OpenAPI Specification), MCP (Model Context Protocol), and A2A (Agent-to-Agent) protocol ecosystem

What the JD emphasized

  • core platform
  • autonomous AI agent
  • reasoning engine
  • tool orchestration
  • runtime infrastructure
  • Micro-services ecosystem
  • knowledge retrieval
  • log analysis
  • code execution
  • workflow automation
  • production deployment
  • real usage
  • observability
  • track quality
  • agent more effective
  • high-impact workflows
  • AI-assisted automation
  • developer productivity
  • Python services
  • Kubernetes infrastructure
  • data stores
  • CI/CD pipelines
  • developer-facing tools
  • production services
  • guide AI tools effectively
  • reason about system behavior end-to-end
  • containerization and orchestration
  • Docker
  • Kubernetes
  • REST APIs
  • microservice architectures
  • distributed systems
  • complex concepts
  • fast-paced environment
  • Kubernetes operators
  • Helm charts
  • cluster management
  • LLM application development
  • prompt engineering
  • agentic frameworks
  • ReAct
  • tool-use
  • RAG pipelines
  • FastAPI
  • async Python
  • modern Python web frameworks
  • vector databases
  • semantic search
  • embedding models
  • OAS
  • OpenAPI Specification
  • MCP
  • Model Context Protocol
  • A2A
  • Agent-to-Agent protocol ecosystem

Other signals

  • AI engineering platform
  • AI agentic workflows
  • autonomous, long living agentic workflows