Senior Hpc Devops Engineer

NVIDIA NVIDIA · Semiconductors · Yokneam, Israel

NVIDIA is seeking a Senior HPC DevOps and Network Engineer to build and maintain large-scale HPC/AI clusters. The role involves designing, implementing, and automating infrastructure, CI/CD pipelines, and networking for AI and GPU computing environments. The engineer will troubleshoot complex issues, support R&D, and work with researchers and developers to optimize workflows and develop solutions.

What you'd actually do

  1. Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems.
  2. Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments.
  3. Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes.
  4. Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring.
  5. Develop complex Networking automations.

Skills

Required

  • HPC and AI solution technologies
  • CPUs, GPUs, high-speed interconnects, and supporting software
  • Programming and scripting languages
  • Object-oriented programming principles
  • Jenkins, Ansible, Puppet/Chef
  • Windows and Linux (Redhat/CentOS and Ubuntu)
  • Networking and OS-level security
  • Networking protocols such as InfiniBand and Ethernet
  • Job scheduling workloads and orchestration tools such as Slurm and Kubernetes
  • Storage solutions like Lustre, GPFS, ZFS, and XFS
  • Virtual systems (VMware, Hyper-V, KVM, Citrix)
  • Cloud platforms (AWS, Azure, Google Cloud)

Nice to have

  • Networking experience or strong knowledge through professional networking training
  • CPU and/or GPU architecture
  • Kubernetes and container-related microservice technologies
  • GPU-focused hardware/software (DGX, CUDA)
  • RDMA (InfiniBand or RoCE) fabrics

What the JD emphasized

  • HPC and AI solution technologies
  • Deep understanding of networking protocols such as InfiniBand and Ethernet
  • Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes
  • Experience with multiple storage solutions like Lustre, GPFS, ZFS, and XFS
  • Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix)