Senior Hpc Devops Engineer

NVIDIA NVIDIA · Semiconductors · Germany +4 · Remote

NVIDIA seeks a Senior HPC DevOps and Network Engineer to build future supercomputers and HPC clusters, focusing on AI and GPU computing advancements. The role involves designing, implementing, and maintaining large-scale clusters with state-of-the-art monitoring, IaC, CI/CD pipelines, and automation. Responsibilities include troubleshooting complex issues, leading technical resources, and supporting R&D for future improvements. Requires 5+ years of experience with HPC/AI technologies, programming/scripting, CI/CD tools, Linux/Windows, networking (InfiniBand, Ethernet), job schedulers (Slurm, Kubernetes), storage solutions, and virtualization. Cloud platform familiarity is a plus.

What you'd actually do

  1. Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems.
  2. Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments.
  3. Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes.
  4. Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring.
  5. Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.

Skills

Required

  • B.Sc. in Computer Science, Engineering, or a related field with 5+ years of experience.
  • Deep knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
  • Advanced proficiency in programming and scripting languages, with a solid understanding of object-oriented programming principles.
  • Familiarity with Jenkins, Ansible, Puppet/Chef.
  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security.
  • Deep understanding of networking protocols such as InfiniBand and Ethernet.
  • Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes.
  • Background with multiple storage solutions like Lustre, GPFS, ZFS, and XFS.
  • Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix).
  • Familiarity with cloud platforms (AWS, Azure, Google Cloud).

Nice to have

  • Proven networking experience or strong knowledge through professional networking training.
  • Knowledge of CPU and/or GPU architecture.
  • Understanding of Kubernetes and container-related microservice technologies.
  • Experience with GPU-focused hardware/software (DGX, CUDA).
  • Background with RDMA (InfiniBand or RoCE) fabrics.

What the JD emphasized

  • HPC and AI solution technologies
  • programming and scripting languages
  • Jenkins, Ansible, Puppet/Chef
  • Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security
  • networking protocols such as InfiniBand and Ethernet
  • job scheduling workloads and orchestration tools such as Slurm and Kubernetes
  • storage solutions like Lustre, GPFS, ZFS, and XFS
  • virtual systems (VMware, Hyper-V, KVM, Citrix)
  • cloud platforms (AWS, Azure, Google Cloud)
  • GPU-focused hardware/software (DGX, CUDA)
  • RDMA (InfiniBand or RoCE) fabrics