Senior Software Engineer, Applied AI

NVIDIA NVIDIA · Semiconductors · Munich, Germany

Senior Software Engineer, Applied AI Systems role focused on building production AI/ML and agentic solutions. Responsibilities include developing agents, workflow services, APIs, data pipelines, tool integrations, evaluation harnesses, and operational tooling. Requires strong Python skills, experience with LLMs, RAG, agentic AI, distributed systems, and system design. The role emphasizes turning ambiguous problems into durable software systems and shaping how production applied AI systems are built and measured.

What you'd actually do

  1. Build and own production-grade applied AI systems for NVIDIA’s technical and solution development use cases, including agentic solutions where they materially improve the systems and softwares.
  2. Design and build agentic workflows and the software around them: workflow services, APIs, retrieval, MCP/A2A-style tool integrations, agent harnesses, automation, telemetry, operational controls, and human oversight.
  3. Design reliable services, APIs, workflow state, event-driven execution, and observability using systems such as Kafka, ClickHouse, and OTel-style patterns.
  4. Translate complex technical and operational requirements into clear system designs, plans, interfaces, measurable outcomes, and pragmatic technical decisions through design reviews, code reviews, and clear communication.
  5. Develop production software in Python and other relevant languages, with strong testing, observability, CI/CD, documentation, and operational practices.

Skills

Required

  • BS, MS, or PhD in Computer Science, Engineering, AI/ML, or equivalent experience
  • 5+ years of professional software engineering experience owning production systems or meaningful platform components
  • Hands-on experience with LLM, generative AI, RAG, agentic AI, MCP or intelligent AI technologies beyond simple prompting or notebooks, including tool use, retrieval, evaluation, guardrails, orchestration, or human-in-the-loop control
  • Strong Python engineering skills
  • practical experience with at least one additional production programming language such as C++, Go, Rust, or TypeScript
  • Demonstrated ability to develop and build distributed systems, backend services, data pipelines, workflow orchestration, APIs, or developer platforms using production environments like Kafka, ClickHouse, PostgreSQL, Redis, object storage, Kubernetes, or similar technologies
  • Strong system design and operational judgment, including reliability, latency, cost, security, privacy, scalability, debuggability, maintainability, performance analysis, benchmarking, profiling, or capacity evaluation
  • Excellent debugging and problem-solving skills across software, infrastructure, AI systems, and performance bottlenecks
  • Proven ownership of ambiguous, cross-team engineering work, with ability to collaborate with distributed teams spanning US Pacific, EMEA, and APAC timezones
  • Strong written and verbal communication skills in English

Nice to have

  • Experience building real-world AI implementations, agent tools, MCP-compatible modules, A2A-style bridges, agent frameworks, evaluation frameworks, or RAG systems used by real users
  • Familiarity with NVIDIA GPU, AI Software Technologies such as NVIDIA NIM, NeMo Agent Toolkit, CUDA and Agentic AI development frameworks
  • Open-source contributions, technical papers, patents, conference talks, engineering blogs, or major internal engineering artifacts

What the JD emphasized

  • production AI / ML and agentic solutions
  • hands-on senior engineer
  • turn ambiguous technical problems into durable software systems
  • build AI systems as real software systems
  • shape how production applied AI systems are built, measured, and reused
  • focus on reusable software capability rather than one-off delivery
  • drives execution across teams
  • production-grade applied AI systems
  • agentic solutions
  • agentic workflows
  • tool integrations
  • human oversight
  • reliable services
  • event-driven execution
  • observability
  • production software
  • strong testing
  • observability
  • CI/CD
  • documentation
  • operational practices
  • performance and benchmarking workflows
  • validation harnesses
  • regression tests
  • tracing
  • metrics
  • failure analysis
  • latency
  • throughput
  • reliability
  • resource usage
  • AI/inference behavior
  • standard solution patterns
  • codify repeated patterns
  • product gaps
  • field lessons
  • APIs
  • services
  • reference architectures
  • playbooks
  • test harnesses
  • shared engineering building blocks
  • debug and support production solutions
  • software
  • infrastructure
  • AI models
  • data pipelines
  • inference services
  • GPU-accelerated environments
  • recurring support patterns
  • product or platform improvements
  • 5+ years of professional software engineering experience owning production systems or meaningful platform components
  • Hands-on experience with LLM, generative AI, RAG, agentic AI, MCP or intelligent AI technologies beyond simple prompting or notebooks
  • tool use
  • retrieval
  • evaluation
  • guardrails
  • orchestration
  • human-in-the-loop control
  • Strong Python engineering skills
  • practical experience with at least one additional production programming language
  • Demonstrated ability to develop and build distributed systems
  • backend services
  • data pipelines
  • workflow orchestration
  • APIs
  • developer platforms
  • production environments
  • Strong system design and operational judgment
  • reliability
  • latency
  • cost
  • security
  • privacy
  • scalability
  • debuggability
  • maintainability
  • performance analysis
  • benchmarking
  • profiling
  • capacity evaluation
  • Excellent debugging and problem-solving skills
  • software
  • infrastructure
  • AI systems
  • performance bottlenecks
  • Proven ownership of ambiguous, cross-team engineering work
  • collaborate with distributed teams
  • Required : Strong written and verbal communication skills in English
  • Experience building real-world AI implementations
  • agent tools
  • MCP-compatible modules
  • A2A-style bridges
  • agent frameworks
  • evaluation frameworks
  • RAG systems used by real users

Other signals

  • building production AI systems
  • agentic workflows
  • software engineering
  • distributed systems
  • performance engineering