Director, Site Reliability and Software Engineering - Dgx Cloud

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

This role is for a Director of Site Reliability and Software Engineering for NVIDIA's DGX Cloud, focusing on managing the software, automation, and operations of distributed GPU clusters. The role involves leading a team, defining strategy and roadmap, driving technical projects, and ensuring operational excellence for a large-scale distributed system that supports AI development and deployment.

What you'd actually do

  1. Manage a team of Software and Site Reliability engineers, including program development, task planning and code reviews.
  2. Define team strategy and roadmap, and drive adoption of scalable SDLC practices, test infrastructure, and modern practices Nvidia’s DGX Cloud Computing environment.
  3. Drive technical projects and provide leadership in an innovative and fast-paced environment.
  4. Be responsible for the overall planning, tracking and success of technical projects.
  5. Work closely with project and product management teams to ensure best-in-class product development.

Skills

Required

  • 12+ overall years of Experience in engineering management
  • 5+ years of leadership
  • Bachelor / Master degree in Computer Science, or equivalent experience
  • Experience in designing and implementing large-scale distributed systems
  • Experience in Containers / Virtualization environments/ Cluster solutions
  • Experience in managing Technical Support / DevOps teams
  • Strong knowledge in Unix/Linux
  • Experience implementing tools, process, internal instrumentation, methodologies and resolving blockages
  • Demonstrated people management and leadership skills, the proven track record of mentoring and coaching team members
  • Ability to quickly learn and evaluate new technologies
  • Ability to influence and establish relationships with other software and IT functional groups such as development, server, storage and security teams

What the JD emphasized

  • track record of having past teams and cross-functional partners respect you as both a technical leader and manager
  • ability to work across multiple different levels of technical and organizational leadership is critical
  • Set appropriate technical excellent bars and deliver projects in tight deadlines.