Senior System Software Engineer

NVIDIA NVIDIA · Semiconductors · Pune, India +1

Senior System Software Engineer at NVIDIA, focusing on building and scaling cloud-native infrastructure, microservices, and distributed systems. The role involves architecting reliable, performant, and scalable cloud solutions, optimizing performance and cost, and leading the design and development of next-generation systems. Responsibilities include job orchestration, resource optimization, self-healing infrastructure, building observability solutions, and leveraging data analytics for system improvement. Requires deep expertise in Kubernetes, public cloud platforms (AWS, Azure, GCP), microservices architecture, and databases, with a strong track record in software engineering and delivering enterprise-grade cloud solutions.

What you'd actually do

  1. Spearhead innovation to architect and deliver highly reliable, performant, and scalable cloud-native systems.
  2. Lead the design and development of next-generation microservices and distributed systems with a strong emphasis on performance optimization and cost efficiency.
  3. Define and evolve system architecture strategies, ensuring alignment with long-term business and technical goals.
  4. Tackle complex challenges in job orchestration, resource optimization, and self-healing infrastructure with a focus on automation and resilience.
  5. Build and scale end-to-end observability solutions including metrics pipelines, alerting frameworks, and telemetry storage.

Skills

Required

  • building and scaling large-scale cloud infrastructure platforms
  • software engineering
  • delivering enterprise-grade cloud solutions
  • microservices architecture
  • designing and developing scalable, distributed systems
  • public cloud platforms (AWS, Azure, GCP)
  • scaling infrastructure
  • Kubernetes expertise
  • container orchestration
  • cloud-native tooling
  • SQL (e.g., MySQL)
  • NoSQL (e.g., Elasticsearch)
  • scalable storage systems
  • Web Services (SOAP/REST)
  • messaging systems like Kafka
  • CI/CD tools (Jenkins, Git, Perforce)
  • debugging
  • problem-solving
  • communication skills
  • technical leadership
  • collaboration

Nice to have

  • architect and deliver highly reliable, performant, and scalable cloud-native systems
  • performance optimization
  • cost efficiency
  • system architecture strategies
  • job orchestration
  • resource optimization
  • self-healing infrastructure
  • automation and resilience
  • end-to-end observability solutions
  • metrics pipelines
  • alerting frameworks
  • telemetry storage
  • data analytics
  • predictive modeling
  • mentorship
  • product, infrastructure, and operations groups
  • engineering excellence
  • continuous improvement
  • massively scalable systems
  • thousands to millions of jobs and servers
  • deconstruct complex systems into modular, scalable components with measurable outcomes
  • scale systems to handle millions of concurrent jobs and global workloads
  • optimizing cloud infrastructure for performance, reliability, and cost
  • guide and influence within a dynamic environment
  • push the boundaries of system performance and reliability

What the JD emphasized

  • highly reliable, performant, and scalable cloud-native systems
  • next-generation microservices and distributed systems
  • performance optimization
  • cost efficiency
  • system architecture strategies
  • job orchestration
  • resource optimization
  • self-healing infrastructure
  • automation and resilience
  • end-to-end observability solutions
  • metrics pipelines
  • alerting frameworks
  • telemetry storage
  • data analytics
  • predictive modeling
  • technical leadership
  • mentorship
  • product, infrastructure, and operations groups
  • engineering excellence
  • continuous improvement
  • massively scalable systems
  • thousands to millions of jobs and servers
  • Kubernetes
  • public cloud platforms (AWS, Azure, GCP)
  • building and scaling large-scale cloud infrastructure platforms
  • 10+ years of proven experience in software engineering
  • delivering enterprise-grade cloud solutions
  • Deep expertise in microservices architecture
  • hands-on experience designing and developing scalable, distributed systems
  • Extensive experience with public cloud platforms (AWS, Azure, GCP)
  • scaling infrastructure to support thousands to millions of jobs and servers
  • Strong Kubernetes expertise
  • container orchestration
  • cloud-native tooling for deployment, monitoring, and management
  • Proficiency in both SQL (e.g., MySQL) and NoSQL (e.g., Elasticsearch) databases
  • scalable storage systems
  • Web Services (SOAP/REST)
  • messaging systems like Kafka
  • CI/CD tools such as Jenkins, Git, and Perforce
  • Excellent debugging, problem-solving, and communication skills
  • lead and collaborate effectively
  • globally distributed, multi-time-zone environment
  • deconstruct complex systems into modular, scalable components with measurable outcomes
  • scale systems to handle millions of concurrent jobs and global workloads
  • optimizing cloud infrastructure for performance, reliability, and cost
  • Solid collaborative and interpersonal skills
  • effectively guide and influence within a dynamic environment
  • Relentless drive to push the boundaries of system performance and reliability