Senior System Software Engineer - Github

NVIDIA NVIDIA · Semiconductors · Pune, India

NVIDIA is seeking a senior software engineer to build and maintain a large-scale private cloud system on GitHub and Kubernetes, supporting CI services for various NVIDIA teams. The role involves scaling cloud services to thousands of servers, handling millions of jobs daily, and improving the efficiency of NVIDIA's software engineers. Responsibilities include building scalable cloud solutions, tackling infrastructure challenges like job scheduling and resource management, developing metrics/alert/storage services, and applying ML/deep learning to improve system performance. The ideal candidate has a strong OOP background (preferably Java), experience in large-scale cloud infrastructure, Kubernetes, message brokers, and databases, with 5+ years of experience.

What you'd actually do

  1. Build creative, scalable cloud solutions to handle millions of jobs and thousands of systems
  2. Tackle challenging problems in infrastructure such as job scheduling, resource management, and automated recovery
  3. Develop complete solutions including Metrics, Alert, and Storage Services
  4. Dig into data, analyze it extensively, and apply deep learning algorithms/machine learning to improve system performance and predictability
  5. Contribute to our GitHub-based CI workflow to streamline and optimize processes

Skills

Required

  • Strong object-oriented programming background, with a preference for Java
  • Proven experience in developing large-scale cloud infrastructure applications
  • Knowledge of various technologies including Kubernetes and Message brokers
  • Experience with relational databases like MySQL, and NoSQL databases such as Elasticsearch
  • Ability to work effectively with various teams across different time zones
  • BS/MS in Computer Science, Computer Engineering, or equivalent experience
  • 5+ years of proven experience

Nice to have

  • Real-world experience with distributed systems, containers, and Kubernetes API
  • Proficiency in computer algorithms with the capability to select the most suitable algorithms for complex problems
  • Skill in breaking down complex problems into manageable sub-problems and reusing solutions effectively
  • Experience in crafting, implementing, and deploying major infrastructure features across multiple servers with incremental rollout
  • Proficiency in Machine Learning and Data Analytics, and their application in Infrastructure as well as the ability to build simple systems that operate efficiently with minimal support

What the JD emphasized

  • 5+ years of proven experience