Hpc Operations Engineer

NVIDIA NVIDIA · Semiconductors · Bangalore, India

NVIDIA is seeking an HPC Operations Engineer to provide first-line support for HPC users, troubleshoot job failures, monitor system health, and contribute to process improvements. The role requires experience supporting Linux-based production environments and strong troubleshooting skills.

What you'd actually do

  1. Provide first-line support for HPC users across scheduling, compute, storage, and access-related issues
  2. Troubleshoot job failures, scheduler errors, resource constraints, and performance concerns, driving issues to resolution or appropriate customer concern
  3. Perform triage of infrastructure incidents, gathering diagnostics and advancing to subject matter experts (SMEs) when issues extend beyond defined ownership
  4. Monitor system health, queues, node status, and service availability to ensure stable daily operations
  5. Complete established operational procedures for maintenance, patching, and configuration updates

Skills

Required

  • Linux systems administration fundamentals
  • Troubleshoot technical issues methodically
  • Experience interacting directly with users in a technical support or operations role
  • Clear documentation and procedural guides
  • Attention to detail

Nice to have

  • Scripting or automation experience (e.g., Bash or Python)
  • Workload schedulers such as LSF, Slurm, or similar systems
  • Network computing supporting infrastructure (NFS, automounter, LDAP)
  • HPC or large-scale compute environments
  • EDA workloads