Hpc Operations Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

NVIDIA is seeking an HPC Operations Engineer to design, implement, and operate large-scale compute clusters that power silicon development. The role involves troubleshooting, automation, configuration management, and ensuring system reliability and efficiency within a High-Performance Computing environment.

What you'd actually do

  1. Troubleshoot incoming support requests in a large-scale HPC environment.
  2. Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring and day to day operation through automation.
  3. Ensure compute servers are running correct Operating System and configuration.
  4. Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
  5. Collaborate with specialist teams to drive issues to closure.

Skills

Required

  • Centos/RHEL Linux administration
  • Docker
  • Python
  • bash scripting
  • problem-solving
  • communication
  • teamwork
  • Ansible

Nice to have

  • NFS
  • automounter
  • LDAP
  • DNS
  • TCP/IP networking
  • job scheduler administration (LSF or SLURM)
  • FlexLM license management
  • Perl
  • High-Speed Networking (InfiniBand, RDMA, RoCE)
  • distributed storage systems (Lustre, GPFS)

What the JD emphasized

  • large-scale HPC environment
  • automation
  • system reliability and efficiency