Senior Site Reliability Engineer, Hpc and Lsf

NVIDIA NVIDIA · Semiconductors · Bangalore, India

NVIDIA is seeking a Senior Site Reliability Engineer to manage and operate large-scale compute clusters that power silicon development. The role involves automating deployments, managing workload schedulers, troubleshooting complex issues, and optimizing system performance and reliability. The engineer will collaborate with domain experts to improve infrastructure utilization and contribute to faster time-to-market for new chips.

What you'd actually do

  1. Manage and support workload and resource schedulers in a large-scale HPC environment.
  2. Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring.
  3. Develop solutions for complex computing resource management requirements.
  4. Extract and leverage grid performance metrics for troubleshooting and performance optimization.
  5. Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.

Skills

Required

  • job scheduler administration (e.g. IBM Spectrum LSF or SLURM)
  • Centos/RHEL Linux distributions administration
  • container technologies like Docker
  • UNIX scripting languages
  • Python
  • problem-solving skills
  • communication and teamwork skills
  • large, distributed Linux environment experience

Nice to have

  • analyzing and tuning performance for HPC or EDA workloads
  • Ansible
  • Perl
  • distributed system principles