Senior Platform and Engops Engineer - Cluster Operations

NVIDIA NVIDIA · Semiconductors · Bangalore, India

NVIDIA is seeking a Senior Platform and EngOps Engineer to manage and maintain large GPU clusters, focusing on automation, deployment, provisioning, and troubleshooting to ensure seamless operations for AI and High-Performance Computing initiatives.

What you'd actually do

  1. Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand
  2. Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations.
  3. Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance.
  4. Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions.
  5. Collaborate effectively with dynamic Engineering and Product Teams across multiple time zones to align cluster operations with evolving project requirements.

Skills

Required

  • Deploying and administrating clusters, servers, switches, and related infrastructure
  • Automation
  • Ansible
  • Python
  • Shell Scripting
  • Operating systems
  • Computer networks
  • High-performance applications
  • Linux fundamentals

Nice to have

  • Resource scheduling managers (Slurm)
  • Industry standard alerting tools
  • Emergency response practices
  • GPU-focused hardware and software (DGX systems, Compute Clusters)
  • Metrics collection and alerting infrastructure
  • Large scale networking technologies

What the JD emphasized

  • 5+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure
  • Automation expert with hands on skills in Ansible, Python and Shell Scripting
  • Deep understanding of operating systems, computer networks, and high-performance applications
  • Proficient with Linux fundamentals