Senior Hpc Cluster Administrator - Deep Learning Frameworks Infrastructure

NVIDIA NVIDIA · Semiconductors · Poland +1 · Remote

NVIDIA is seeking a Senior HPC Cluster Administrator to manage large-scale GPU compute clusters for deep learning training, inference, and HPC workloads. The role involves full lifecycle management, automation, performance optimization, and collaboration with ML engineers and software teams.

What you'd actually do

  1. Own the full lifecycle of GPU compute clusters — procurement, provisioning, configuration management, monitoring, and deprecation — across heterogeneous Linux environments (DGX, HGX, embedded systems)
  2. Design and scale storage solutions (NFS, Lustre, WekaFS, or equivalent) with a clear roadmap for capacity and performance growth
  3. Lead automation of infrastructure using modern IaC tools (Ansible, Terraform) and CI/CD pipelines (GitLab)
  4. Manage and optimize job scheduling via Slurm, including fair-share policies, reservation management, and MIG/GPU partitioning strategies
  5. Maintain and improve observability stacks (Prometheus, Grafana, DCGM) and drive proactive resolution of hardware and software incidents

Skills

Required

  • BS/MS in CS, EE, CE, or equivalent hands-on experience
  • 5+ years of experience deploying and administering large-scale HPC or ML training clusters
  • Deep expertise in Linux systems administration at scale
  • Strong scripting and automation skills in Python and/or bash
  • Hands-on experience with Slurm (scheduling, accounting, cgroup configuration)
  • Proficiency with configuration management and IaC (Ansible required; Terraform a plus)
  • Experience with container technologies (Docker, Apptainer/Singularity, Kubernetes)
  • Solid understanding of high-speed networking (InfiniBand, RoCE, RDMA, EFA)
  • Experience with distributed/parallel filesystems and storage architecture
  • Ability to own problems end-to-end and communicate clearly with engineering and management stakeholders

Nice to have

  • Experience with NVIDIA GPU infrastructure tools (DCGM, nvidia-smi, MIG, NVSwitch diagnostics)
  • Familiarity with cluster management platforms (Colossus, Bright Cluster Manager, xCAT, or similar)
  • Experience supporting large-scale distributed deep learning workloads (PyTorch, JAX, Megatron)
  • Knowledge of BMC/IPMI/Redfish for out-of-band management and hardware lifecycle
  • Background in MLOps tooling or ML platform engineering

What the JD emphasized

  • large-scale HPC or ML training clusters
  • Deep expertise in Linux systems administration at scale
  • Hands-on experience with Slurm
  • Proficiency with configuration management and IaC (Ansible required; Terraform a plus)
  • Solid understanding of high-speed networking (InfiniBand, RoCE, RDMA, EFA)
  • Experience with distributed/parallel filesystems and storage architecture
  • Ability to own problems end-to-end

Other signals

  • large-scale GPU compute clusters
  • deep learning training, inference
  • HPC workloads
  • infrastructure ahead of the workloads