Senior Infrastructure Software Systems Engineer

NVIDIA · Semiconductors · Bangalore, India

Senior Software Engineer role focused on building and extending scalable, high-performance core infrastructure systems and distributed workflow platforms for NVIDIA's chip-design ecosystem. The role involves designing and optimizing distributed systems for orchestrating workloads, defining system architecture, and owning systems end-to-end. Requires strong experience in distributed systems, algorithms, concurrency, and programming languages like Python, C++, or Go.

What you'd actually do

Build and extend scalable, high-performance core infrastructure systems and workflow platforms that improve reliability and developer productivity across NVIDIA’s chip-design ecosystem.
Design and optimize distributed systems that orchestrate millions of regression and validation workloads across heterogeneous compute environments.
Design systems that coordinate dependency-aware execution across large-scale compute clusters.
Define system architecture including APIs, data models, execution models, and scaling strategies.
Own systems end-to-end, from gathering requirements and proposing technical designs to implementation, performance analysis, testing, and deployment.

Skills

Required

BS or MS in Computer Science or a related field (or equivalent experience)
9+ years of professional software development experience
Strong foundation in data structures, algorithms, concurrency, and distributed system design
Demonstrated experience designing and building distributed systems from first principles — including defining APIs, data models, execution flows, and scaling approaches
Experience owning systems through design, implementation, and evolution — including handling trade-offs, failure modes, and system limitations
Experience working on systems involving scheduling, dependency resolution, or large-scale job orchestration
Proficiency in modern programming languages (Python, C++, Go, or similar) on Linux systems, with experience building large-scale systems or infrastructure software
Ability to clearly articulate architecture decisions, trade-offs, and how systems evolved over time

Nice to have

Experience improving developer productivity through infrastructure or platform design
Hands-on familiarity with profiling, tracing, or performance-optimization techniques
Understanding of chip-design, verification, or modern ML workflows

What the JD emphasized

building new infrastructure systems
distributed workflow platforms
scale the infrastructure
core infrastructure systems
workflow platforms
distributed systems
dependency-aware execution
large-scale compute clusters
system architecture
scaling strategies
systems end-to-end
system performance
distributed systems
large-scale systems
infrastructure software
developer productivity through infrastructure or platform design

Read full job description

As a Software Engineer in NVIDIA’s Internal Infrastructure Group, you’ll design and build distributed systems that power the workflows behind our next generation of GPUs and AI chips. The software you create will help thousands of engineers develop world-changing technology faster, more efficiently, and at scale. You’ll help scale the infrastructure that validates the world’s most advanced GPUs.

This role focuses on building new infrastructure systems and distributed workflow platforms.

What you'll be doing:

Build and extend scalable, high-performance core infrastructure systems and workflow platforms that improve reliability and developer productivity across NVIDIA’s chip-design ecosystem.
Design and optimize distributed systems that orchestrate millions of regression and validation workloads across heterogeneous compute environments.
Design systems that coordinate dependency-aware execution across large-scale compute clusters.
Define system architecture including APIs, data models, execution models, and scaling strategies.
Own systems end-to-end, from gathering requirements and proposing technical designs to implementation, performance analysis, testing, and deployment.
Collaborate with internal teams to understand workflows, identify bottlenecks, and deliver automation that accelerates engineering workflows.
Analyze and tune system performance across distributed services using profiling, tracing, and telemetry.

What we need to see:

BS or MS in Computer Science or a related field (or equivalent experience).
9+ years of professional software development experience.
Strong foundation in data structures, algorithms, concurrency, and distributed system design.
Demonstrated experience designing and building distributed systems from first principles — including defining APIs, data models, execution flows, and scaling approaches.
Experience owning systems through design, implementation, and evolution — including handling trade-offs, failure modes, and system limitations.
Experience working on systems involving scheduling, dependency resolution, or large-scale job orchestration.
Proficiency in modern programming languages (Python, C++, Go, or similar) on Linux systems, with experience building large-scale systems or infrastructure software.
Ability to clearly articulate architecture decisions, trade-offs, and how systems evolved over time.

Ways to stand out from the crowd:

Experience improving developer productivity through infrastructure or platform design.
Hands-on familiarity with profiling, tracing, or performance-optimization techniques.
Understanding of chip-design, verification, or modern ML workflows.

NVIDIA is widely recognized as one of the most desirable employers in technology. We attract some of the world’s most forward-thinking and dedicated engineers. If you’re driven by curiosity, care deeply about performance and reliability, and love building systems that empower other developers, we want to meet you.

#LI-Hybrid

This role focuses on building new infrastructure systems and distributed workflow platforms.

What you'll be doing:

Build and extend scalable, high-performance core infrastructure systems and workflow platforms that improve reliability and developer productivity across NVIDIA’s chip-design ecosystem.
Design and optimize distributed systems that orchestrate millions of regression and validation workloads across heterogeneous compute environments.
Design systems that coordinate dependency-aware execution across large-scale compute clusters.
Define system architecture including APIs, data models, execution models, and scaling strategies.
Own systems end-to-end, from gathering requirements and proposing technical designs to implementation, performance analysis, testing, and deployment.
Collaborate with internal teams to understand workflows, identify bottlenecks, and deliver automation that accelerates engineering workflows.
Analyze and tune system performance across distributed services using profiling, tracing, and telemetry.

What we need to see:

BS or MS in Computer Science or a related field (or equivalent experience).
9+ years of professional software development experience.
Strong foundation in data structures, algorithms, concurrency, and distributed system design.
Demonstrated experience designing and building distributed systems from first principles — including defining APIs, data models, execution flows, and scaling approaches.
Experience owning systems through design, implementation, and evolution — including handling trade-offs, failure modes, and system limitations.
Experience working on systems involving scheduling, dependency resolution, or large-scale job orchestration.
Proficiency in modern programming languages (Python, C++, Go, or similar) on Linux systems, with experience building large-scale systems or infrastructure software.
Ability to clearly articulate architecture decisions, trade-offs, and how systems evolved over time.

Ways to stand out from the crowd:

Experience improving developer productivity through infrastructure or platform design.
Hands-on familiarity with profiling, tracing, or performance-optimization techniques.
Understanding of chip-design, verification, or modern ML workflows.

#LI-Hybrid