Senior System Software Architect, Hpc and AI Networking

NVIDIA · Semiconductors · Beijing, China

NVIDIA is seeking a Senior System Software Architect to design and prototype scalable software systems for distributed AI training and inference, focusing on optimizing throughput, latency, and memory efficiency. The role involves developing and evaluating communication libraries, collaborating with AI framework teams, co-designing hardware features for AI acceleration, and contributing to runtime systems and protocol layers.

What you'd actually do

Design and prototype scalable software systems that optimize distributed AI training and inference—focusing on throughput, latency, and memory efficiency.
Develop and evaluate enhancements to communication libraries such as NCCL, UCX, and UCC, tailored to the unique demands of deep learning workloads.
Collaborate with AI framework teams (e.g., TensorFlow, PyTorch, JAX) to improve integration, performance, and reliability of communication backends.
Co-design hardware features (e.g., in GPUs, DPUs, or interconnects) that accelerate data movement and enable new capabilities for inference and model serving.
Contribute to the evolution of runtime systems, communication libraries, and AI-specific protocol layers.

Skills

Required

Ph.D, Masters, or Bachelors in computer science, computer engineering, electrical engineering or a closely related field.
5+ years of experience in DNNs, Scaling of DNNs, Parallelism of DNN frameworks, or deep learning training workloads.
Deep understanding of Inference and Training workloads and optimizations, like Prefill/Decode, data parallelism, Tensor parallelism, FDSP, etc...
Experience with AI network parallelism using collective libraries and RDMA/RoCE.
Background in algorithm design, system programming, and computer architecture.
Strong programming and software development skills.
Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.

Nice to have

Deep understanding of technology and passion for what you do.
Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment.
Background with designing communication middleware for high-performance computing systems, including RoCE and DPUs.
Background with CUDA programming and NVIDIA GPUs and programming models for emerging architectures.

What the JD emphasized

HPC and AI Inference Software Architect
distributed AI training
real-time inference
communication optimization
scalable AI infrastructure
Inference and Training workloads and optimizations
AI network parallelism

Other signals

distributed AI training
real-time inference
communication optimization
scalable AI infrastructure

Read full job description

NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, we lead in artificial intelligence, driving advances in natural language processing, computer vision, autonomous systems, and scientific research. We are looking for a forward-thinking HPC and AI Inference Software Architect to help shape the future of scalable AI infrastructure—focusing on distributed training, real-time inference, and communication optimization across large-scale systems.

Join our world-class team of researchers and engineers building next-generation software and hardware systems that power the most demanding AI workloads on the planet.

What you will be doing:

Design and prototype scalable software systems that optimize distributed AI training and inference—focusing on throughput, latency, and memory efficiency.
Develop and evaluate enhancements to communication libraries such as NCCL, UCX, and UCC, tailored to the unique demands of deep learning workloads.
Collaborate with AI framework teams (e.g., TensorFlow, PyTorch, JAX) to improve integration, performance, and reliability of communication backends.
Co-design hardware features (e.g., in GPUs, DPUs, or interconnects) that accelerate data movement and enable new capabilities for inference and model serving.
Contribute to the evolution of runtime systems, communication libraries, and AI-specific protocol layers.
Collaborate with customers to understand their needs and provide innovative solutions for them.

What we need to see:

Ph.D, Masters, or Bachelors in computer science, computer engineering, electrical engineering or a closely related field.
5+ years of experience in DNNs, Scaling of DNNs, Parallelism of DNN frameworks, or deep learning training workloads.
Deep understanding of Inference and Training workloads and optimizations, like Prefill/Decode, data parallelism, Tensor parallelism, FDSP, etc...
Experience with AI network parallelism using collective libraries and RDMA/RoCE.
Background in algorithm design, system programming, and computer architecture.
Strong programming and software development skills.
Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.

Ways to stand out from the crowd:

Deep understanding of technology and passion for what you do.
Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment.
Background with designing communication middleware for high-performance computing systems, including RoCE and DPUs.
Background with CUDA programming and NVIDIA GPUs and programming models for emerging architectures.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#LI-Hybrid

Join our world-class team of researchers and engineers building next-generation software and hardware systems that power the most demanding AI workloads on the planet.

What you will be doing:

Design and prototype scalable software systems that optimize distributed AI training and inference—focusing on throughput, latency, and memory efficiency.
Develop and evaluate enhancements to communication libraries such as NCCL, UCX, and UCC, tailored to the unique demands of deep learning workloads.
Collaborate with AI framework teams (e.g., TensorFlow, PyTorch, JAX) to improve integration, performance, and reliability of communication backends.
Co-design hardware features (e.g., in GPUs, DPUs, or interconnects) that accelerate data movement and enable new capabilities for inference and model serving.
Contribute to the evolution of runtime systems, communication libraries, and AI-specific protocol layers.
Collaborate with customers to understand their needs and provide innovative solutions for them.

What we need to see:

Ph.D, Masters, or Bachelors in computer science, computer engineering, electrical engineering or a closely related field.
5+ years of experience in DNNs, Scaling of DNNs, Parallelism of DNN frameworks, or deep learning training workloads.
Deep understanding of Inference and Training workloads and optimizations, like Prefill/Decode, data parallelism, Tensor parallelism, FDSP, etc...
Experience with AI network parallelism using collective libraries and RDMA/RoCE.
Background in algorithm design, system programming, and computer architecture.
Strong programming and software development skills.
Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.

Ways to stand out from the crowd:

Deep understanding of technology and passion for what you do.
Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment.
Background with designing communication middleware for high-performance computing systems, including RoCE and DPUs.
Background with CUDA programming and NVIDIA GPUs and programming models for emerging architectures.

#LI-Hybrid