Senior Software Research Architect, AI Networking

NVIDIA · Semiconductors · Tel Aviv, Israel

NVIDIA is seeking a Senior Software Research Architect to improve the framework for large-scale LLM learning and prediction. This role focuses on designing and optimizing systems for generative AI workloads on advanced GPU clusters, specifically leveraging the NVIDIA Spectrum-X Networking Platform to define deployment and scaling strategies. The architect will work on inter-node communication, compute scheduling, and system-level optimization, collaborating with engineers and researchers to enable generative AI technologies in real-world applications.

What you'd actually do

  1. Lead research and development of end-to-end networking solutions for distributed AI training and inference at scale, with a focus on job completion time, failure resiliency, telemetry, scheduling, and placement.
  2. Analyze current deployments, develop prototypes, and recommend architectural improvements.
  3. Stay abreast of the latest research; become the team’s authority in emerging networking techniques and technologies.
  4. Design, simulate, and validate new systems using novel, scalable network simulator NSX.
  5. Develop and test prototypes on large-scale GPU clusters (e.g., Israel-1).

Skills

Required

  • M.Sc. or PhD (preferred) in Computer Science, Electrical/Computer Engineering, or related field—or B.Sc. with research experience and publications.
  • 5+ years of relevant experience.
  • Deep expertise in networking and communication internals (NCCL, RDMA, congestion control, routing).
  • Strong software engineering skills in C++ and/or Python.
  • Excellent system-level design and problem-solving abilities.
  • Outstanding communication and collaboration skills across technical domains.

Nice to have

  • Proven passion for solving sophisticated technical problems and delivering impactful solutions.
  • Record of publications in top-tier conferences.
  • Experience in designing and building large-scale AI training clusters.
  • Post-PhD research experience
  • Practical understanding of deep learning systems, GPU acceleration, and AI model execution flows.

What the JD emphasized

  • Deep expertise in networking and communication internals (NCCL, RDMA, congestion control, routing).
  • Strong software engineering skills in C++ and/or Python.
  • Excellent system-level design and problem-solving abilities.
  • Record of publications in top-tier conferences.

Other signals

  • distributed AI training and inference at scale
  • GPU clusters
  • generative AI workloads
  • NVIDIA Spectrum-X Networking Platform