Senior Hpc Software Engineer

NVIDIA NVIDIA · Semiconductors · Yokneam, Israel

Senior HPC Software Engineer at NVIDIA focused on building and supporting critical services, automation, performance issue detection, capacity planning, and SRE solutions in a multi-cloud hybrid environment. Requires strong coding, Kubernetes, CI/CD, IaC, and SRE capabilities, with full-stack AI experience mentioned as a requirement.

What you'd actually do

  1. Own the solutions you build, collaborating with cross-functional teams to successfully implement them.
  2. Collaborate with various teams in a fast-paced environment to ensure seamless project completion.
  3. Continuously improve solution provisioning and management through automation.
  4. Detect performance issues and recommend solutions to maintain world-class service quality.
  5. Conduct capacity management and planning to meet ongoing operational needs.

Skills

Required

  • B.S. degree in Computer Science or related technical field (or equivalent experience)
  • 8+ years in building and supporting critical services
  • 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.
  • Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).
  • Full-stack AI experience with deep expertise in MCP ecosystems, Carpenter, n8n orchestration, and AI-assisted development via Cursor.
  • Expertise with at least one major cloud service provider - AWS, GCP, Azure.
  • Demonstrated proficiency with end-to-end SRE capabilities and observability.
  • Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.
  • Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Nice to have

  • Linux certification from a well-known vendor - RedHat, Oracle, etc.
  • Prior experience managing large-scale Kubernetes deployment in production.
  • Strong skills in modern container networking and storage architecture.
  • Hands-on background working with Flexlm and license management system.
  • Hands-on experience working with Slurm/LSF environments.

What the JD emphasized

  • 8+ years in building and supporting critical services
  • 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.
  • Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).
  • Full-stack AI experience with deep expertise in MCP ecosystems, Carpenter, n8n orchestration, and AI-assisted development via Cursor.
  • Demonstrated proficiency with end-to-end SRE capabilities and observability.