Senior Software Architect, AI Networking

NVIDIA NVIDIA · Semiconductors · Tel Aviv, Israel +1

NVIDIA is looking for a Senior Software Architect to design and optimize inference infrastructure for large language models running on GPU clusters. The role involves working across software and hardware domains to define deployment and scaling strategies, optimize latency and throughput, and collaborate with various teams to ensure high-performance solutions.

What you'd actually do

  1. Design and evolve scalable architectures for multi-node LLM inference across GPU clusters.
  2. Develop infrastructure to optimize latency, throughput, and cost-efficiency of serving large models in production.
  3. Collaborate with model, systems, compiler, and networking teams to ensure holistic, high-performance solutions.
  4. Prototype novel approaches to KV cache handling, tensor/pipeline parallel execution, and dynamic batching.
  5. Evaluate and integrate new software and hardware technologies relevant to Core Spectrum-X technologies, such as load balancing, telemetry, congestion control, vertical application integration.

Skills

Required

  • C++
  • Python
  • CUDA
  • distributed systems
  • performance optimization
  • deep learning systems
  • GPU acceleration
  • AI model execution flows
  • high performance networking
  • system-level thinking
  • memory management
  • networking
  • scheduling
  • compute orchestration

Nice to have

  • LLM training pipelines
  • LLM inference pipelines
  • transformer model optimization
  • model-parallel deployments
  • profiling
  • AI Accelerators
  • distributed communication patterns
  • congestion control
  • load balancing

What the JD emphasized

  • 8+ years of experience building large-scale distributed systems or performance-critical software.
  • Deep understanding of deep learning systems, GPU acceleration, and AI model execution flows and/or high performance networking.
  • Solid software engineering skills in C++ and/or Python, preferably demonstrate strong familiarity with CUDA or similar platforms.
  • Strong system-level thinking across memory, networking, scheduling, and compute orchestration.
  • Experience working on LLM - training or inference pipelines, transformer model optimization, or model-parallel deployments.
  • Demonstrated success in profiling and optimizing performance bottlenecks across the LLM training or inference stack.
  • AI Accelerators and distributed communication patterns, congestion control and/or load balancing.
  • Proven optimization process for complex systems, deployed at scale to make impact.

Other signals

  • LLM inference at scale
  • GPU clusters
  • system-level optimizations