What you'd actually do

Design, build, and operate scalable infrastructure for training Cosmos and supporting large-scale data pipelines

Develop high-throughput systems for data processing, retrieval, and workflow orchestration

Collaborate across research, optimization, and platform teams to accelerate experiments and deployments

Improve system reliability, performance, and observability across distributed compute environments

Contribute to long-term infrastructure strategy for training, data management, and large-scale compute efficiency

Skills

Required

Masters Degree in Computer Science, Computer Engineering, related STEM Degree, or equivalent experience
6 years of relevant work experience
Python
C++
Go
Rust
distributed systems
ML infrastructure
large-scale compute/data platforms
orchestration systems
scheduling
scalable storage
data pipelines
bridging research workflows and production-grade systems

Nice to have

Experience building or optimizing infrastructure for large-scale model training
Hands-on work with distributed compute environments or high-performance systems
Familiarity with synthetic data, simulation pipelines, or large multimodal datasets
Contributions to open-source infrastructure or large-scale internal tooling

What the JD emphasized

Masters Degree in Computer Science, Computer Engineering, related STEM Degree, or equivalent experience

6 years of relevant work experience

Proficiency in Python and at least one systems language (e.g., C++/Go/Rust)

Experience with orchestration systems, scheduling, and scalable storage or data pipelines

Comfortable bridging research workflows and production-grade systems

Other signals

building systems that make it possible to train Cosmos

enables large-scale AI models for robots, autonomous agents, and AI systems to understand, plan, and act in complex environments

develops the Cosmos platform infrastructure that powers model training, data pipelines, simulation, and deployment at scale

We are now looking for a Senior Software Engineer for Generative AI Research! At NVIDIA, we believe the next generation of AI will be physical AI – systems that perceive, reason, and act in the real world. Building these models requires building robust systems that span across large-scale compute, multimodal datasets, simulation-driven synthetic data, and real-time reasoning for robots and autonomous systems.

Our Cosmos infrastructure team sits at the heart of this mission. We build the systems that make it possible to train Cosmos, NVIDIA’s world foundation model for physical AI. Cosmos enables large-scale AI models for robots, autonomous agents, and AI systems to understand, plan, and act in complex environments. Our team develops the Cosmos platform infrastructure that powers model training, data pipelines, simulation, and deployment at scale, enabling research and production to move faster and more efficiently than ever before. This role is a unique opportunity to work on infrastructure that directly enables physical AI at scale – from optimizing massive data pipelines to designing training workflows that support foundation models, and from scaling distributed compute systems to building the backbone for simulation-driven experimentation.

What You’ll Be Doing:

Design, build, and operate scalable infrastructure for training Cosmos and supporting large-scale data pipelines
Develop high-throughput systems for data processing, retrieval, and workflow orchestration
Collaborate across research, optimization, and platform teams to accelerate experiments and deployments
Improve system reliability, performance, and observability across distributed compute environments
Contribute to long-term infrastructure strategy for training, data management, and large-scale compute efficiency

What We Need to See:

A Masters Degree in Computer Science, Computer Engineering, related STEM Degree, or equivalent experience.
Strong engineering background in distributed systems, ML infrastructure, or large-scale compute/data platforms with 6 years of relevant work experience
Proficiency in Python and at least one systems language (e.g., C++/Go/Rust)
Experience with orchestration systems, scheduling, and scalable storage or data pipelines
Ability to work across teams, drive technical clarity, and deliver robust solutions in complex environments
Comfortable bridging research workflows and production-grade systems

Ways to Stand Out from the crowd:

Experience building or optimizing infrastructure for large-scale model training
Hands-on work with distributed compute environments or high-performance systems
Familiarity with synthetic data, simulation pipelines, or large multimodal datasets
Contributions to open-source infrastructure or large-scale internal tooling

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 7, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.