Senior System Software Engineer

NVIDIA · Semiconductors · Pune, India +1

Senior System Software Engineer at NVIDIA, focusing on building and scaling cloud-native infrastructure, microservices, and distributed systems. The role involves architecting reliable, performant, and scalable cloud solutions, optimizing performance and cost, and leading the design and development of next-generation systems. Responsibilities include job orchestration, resource optimization, self-healing infrastructure, building observability solutions, and leveraging data analytics for system improvement. Requires deep expertise in Kubernetes, public cloud platforms (AWS, Azure, GCP), microservices architecture, and databases, with a strong track record in software engineering and delivering enterprise-grade cloud solutions.

What you'd actually do

Spearhead innovation to architect and deliver highly reliable, performant, and scalable cloud-native systems.
Lead the design and development of next-generation microservices and distributed systems with a strong emphasis on performance optimization and cost efficiency.
Define and evolve system architecture strategies, ensuring alignment with long-term business and technical goals.
Tackle complex challenges in job orchestration, resource optimization, and self-healing infrastructure with a focus on automation and resilience.
Build and scale end-to-end observability solutions including metrics pipelines, alerting frameworks, and telemetry storage.

Skills

Required

building and scaling large-scale cloud infrastructure platforms
software engineering
delivering enterprise-grade cloud solutions
microservices architecture
designing and developing scalable, distributed systems
public cloud platforms (AWS, Azure, GCP)
scaling infrastructure
Kubernetes expertise
container orchestration
cloud-native tooling
SQL (e.g., MySQL)
NoSQL (e.g., Elasticsearch)
scalable storage systems
Web Services (SOAP/REST)
messaging systems like Kafka
CI/CD tools (Jenkins, Git, Perforce)
debugging
problem-solving
communication skills
technical leadership
collaboration

Nice to have

architect and deliver highly reliable, performant, and scalable cloud-native systems
performance optimization
cost efficiency
system architecture strategies
job orchestration
resource optimization
self-healing infrastructure
automation and resilience
end-to-end observability solutions
metrics pipelines
alerting frameworks
telemetry storage
data analytics
predictive modeling
mentorship
product, infrastructure, and operations groups
engineering excellence
continuous improvement
massively scalable systems
thousands to millions of jobs and servers
deconstruct complex systems into modular, scalable components with measurable outcomes
scale systems to handle millions of concurrent jobs and global workloads
optimizing cloud infrastructure for performance, reliability, and cost
guide and influence within a dynamic environment
push the boundaries of system performance and reliability

What the JD emphasized

highly reliable, performant, and scalable cloud-native systems
next-generation microservices and distributed systems
performance optimization
cost efficiency
system architecture strategies
job orchestration
resource optimization
self-healing infrastructure
automation and resilience
end-to-end observability solutions
metrics pipelines
alerting frameworks
telemetry storage
data analytics
predictive modeling
technical leadership
mentorship
product, infrastructure, and operations groups
engineering excellence
continuous improvement
massively scalable systems
thousands to millions of jobs and servers
Kubernetes
public cloud platforms (AWS, Azure, GCP)
building and scaling large-scale cloud infrastructure platforms
10+ years of proven experience in software engineering
delivering enterprise-grade cloud solutions
Deep expertise in microservices architecture
hands-on experience designing and developing scalable, distributed systems
Extensive experience with public cloud platforms (AWS, Azure, GCP)
scaling infrastructure to support thousands to millions of jobs and servers
Strong Kubernetes expertise
container orchestration
cloud-native tooling for deployment, monitoring, and management
Proficiency in both SQL (e.g., MySQL) and NoSQL (e.g., Elasticsearch) databases
scalable storage systems
Web Services (SOAP/REST)
messaging systems like Kafka
CI/CD tools such as Jenkins, Git, and Perforce
Excellent debugging, problem-solving, and communication skills
lead and collaborate effectively
globally distributed, multi-time-zone environment
deconstruct complex systems into modular, scalable components with measurable outcomes
scale systems to handle millions of concurrent jobs and global workloads
optimizing cloud infrastructure for performance, reliability, and cost
Solid collaborative and interpersonal skills
effectively guide and influence within a dynamic environment
Relentless drive to push the boundaries of system performance and reliability

Read full job description

We are now looking for a Senior System Software Engineer. NVIDIA is the leading artificial intelligence computing company and paving the way with innovations in self-driving cars, machine learning, supercomputing, gaming and visualization. NVIDIA gives automakers, research institutions, cloud providers, large companies, and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems. We are an enthusiastic and dedicated team at the forefront of the latest science and technology trends. Working together, we provide a private on-site cloud solution that enables the rest of the organization to quickly release high-quality software. Are you passionate about infrastructure and looking for complex and challenging issues? Are you ready to build the next generation of cloud services, design innovative solutions that address the needs of a whole organization? Then we are excited to have a motivated person like you!

What you'll be doing:

Spearhead innovation to architect and deliver highly reliable, performant, and scalable cloud-native systems.
Lead the design and development of next-generation microservices and distributed systems with a strong emphasis on performance optimization and cost efficiency.
Define and evolve system architecture strategies, ensuring alignment with long-term business and technical goals.
Tackle complex challenges in job orchestration, resource optimization, and self-healing infrastructure with a focus on automation and resilience.
Build and scale end-to-end observability solutions including metrics pipelines, alerting frameworks, and telemetry storage.
Leverage data analytics and predictive modeling to proactively improve system behavior and reliability.
Provide technical leadership and mentorship across teams while collaborating cross-functionally with product, infrastructure, and operations groups to drive strategic initiatives and foster a culture of engineering excellence and continuous improvement.
Design and operate massively scalable systems—handling thousands to millions of jobs and servers—using deep expertise in Kubernetes and public cloud platforms (AWS, Azure, GCP).

What we need to see:

Demonstrated experience in building and scaling large-scale cloud infrastructure platforms.
10+ years of proven experience in software engineering with a strong track record of delivering enterprise-grade cloud solutions; BS/MS/Ph.D. in Computer Science, Computer Engineering, or equivalent experience.
Deep expertise in microservices architecture, with hands-on experience designing and developing scalable, distributed systems.
Extensive experience with public cloud platforms (AWS, Azure, GCP), including scaling infrastructure to support thousands to millions of jobs and servers.
Strong Kubernetes expertise, including container orchestration and cloud-native tooling for deployment, monitoring, and management.
Proficiency in both SQL (e.g., MySQL) and NoSQL (e.g., Elasticsearch) databases, with a solid understanding of scalable storage systems.
Hands-on experience with Web Services (SOAP/REST), messaging systems like Kafka, and CI/CD tools such as Jenkins, Git, and Perforce.
Excellent debugging, problem-solving, and communication skills, with the ability to lead and collaborate effectively in a globally distributed, multi-time-zone environment.

Ways to stand out from the crowd:

Proven ability to deconstruct complex systems into modular, scalable components with measurable outcomes and scale systems to handle millions of concurrent jobs and global workloads.
Expertise in optimizing cloud infrastructure for performance, reliability, and cost.
Solid collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic environment
Relentless drive to push the boundaries of system performance and reliability.

We are an equal opportunity employer and value diversity at our company. We do not discriminate based on race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform crucial job functions, and to receive other benefits and privileges of employment.

What you'll be doing:

Spearhead innovation to architect and deliver highly reliable, performant, and scalable cloud-native systems.
Lead the design and development of next-generation microservices and distributed systems with a strong emphasis on performance optimization and cost efficiency.
Define and evolve system architecture strategies, ensuring alignment with long-term business and technical goals.
Tackle complex challenges in job orchestration, resource optimization, and self-healing infrastructure with a focus on automation and resilience.
Build and scale end-to-end observability solutions including metrics pipelines, alerting frameworks, and telemetry storage.
Leverage data analytics and predictive modeling to proactively improve system behavior and reliability.
Provide technical leadership and mentorship across teams while collaborating cross-functionally with product, infrastructure, and operations groups to drive strategic initiatives and foster a culture of engineering excellence and continuous improvement.
Design and operate massively scalable systems—handling thousands to millions of jobs and servers—using deep expertise in Kubernetes and public cloud platforms (AWS, Azure, GCP).

What we need to see:

Demonstrated experience in building and scaling large-scale cloud infrastructure platforms.
10+ years of proven experience in software engineering with a strong track record of delivering enterprise-grade cloud solutions; BS/MS/Ph.D. in Computer Science, Computer Engineering, or equivalent experience.
Deep expertise in microservices architecture, with hands-on experience designing and developing scalable, distributed systems.
Extensive experience with public cloud platforms (AWS, Azure, GCP), including scaling infrastructure to support thousands to millions of jobs and servers.
Strong Kubernetes expertise, including container orchestration and cloud-native tooling for deployment, monitoring, and management.
Proficiency in both SQL (e.g., MySQL) and NoSQL (e.g., Elasticsearch) databases, with a solid understanding of scalable storage systems.
Hands-on experience with Web Services (SOAP/REST), messaging systems like Kafka, and CI/CD tools such as Jenkins, Git, and Perforce.
Excellent debugging, problem-solving, and communication skills, with the ability to lead and collaborate effectively in a globally distributed, multi-time-zone environment.

Ways to stand out from the crowd:

Proven ability to deconstruct complex systems into modular, scalable components with measurable outcomes and scale systems to handle millions of concurrent jobs and global workloads.
Expertise in optimizing cloud infrastructure for performance, reliability, and cost.
Solid collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic environment
Relentless drive to push the boundaries of system performance and reliability.