Senior Systems Software Engineer, Data Center Infrastructure Management - Engops

NVIDIA NVIDIA · Semiconductors · TX +3 · Remote

NVIDIA is seeking an EngOps Engineer with 5+ years of experience to join their advanced infrastructure software team. The role involves maintaining high-performance, rack-scale management solutions for datacenter environments, supporting deployment and debugging of hardware and Infrastructure Manager. Responsibilities include troubleshooting cluster failures, managing software/firmware updates, and working with developers and test engineers.

What you'd actually do

  1. Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance.
  2. Manage updates to the site controller management nodes.
  3. Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions.

Skills

Required

  • BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience
  • 5+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure
  • Experience with deployment and configuration of operating systems, computer networks, and high-performance applications
  • Proven ability to work effectively with developers and test engineers across different teams and time zones
  • Experience deploying services in Kubernetes
  • Datacenter or computer architecture experience
  • Background with hardware management protocols (Redfish, IPMI, BMC) and firmware update automation
  • Experience configuring and debugging complex data center networks
  • Experience developing scripts to automate recovery actions for management controllers and datacenter systems

Nice to have

  • Direct experience with industry standard alerting tools and emergency response practices
  • Experience with observability tools such as Grafana
  • Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters
  • Proficiency in designing large scale networking technologies and the associated challenges
  • Experience with OpenStack and Foreman

What the JD emphasized

  • Datacenter or computer architecture experience is required
  • hardware/firmware/software interactions