Capacity Operations Manager

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

NVIDIA is seeking a Capacity Operations Manager to coordinate the development and improvement of High Performance Computing (HPC) clusters, focusing on GPU capacity and compute resources across cloud platforms. The role involves designing and managing data models, reporting platforms, and automation solutions to support governance and strategic capacity decisions. Key responsibilities include assessing requirements, identifying performance bottlenecks, driving infrastructure resource efficiency, and developing tools for cloud infrastructure and analytics, potentially using AI techniques. The role requires collaboration with engineering, finance, and product teams to align capacity management with company goals and ensure customer satisfaction.

What you'd actually do

  1. Coordinate the development of High Performance Computing (HPC) clusters, collaborating closely with internal and external engineering teams.
  2. Direct and improve GPU capacity and additional compute resources across diverse cloud service platforms to satisfy rising needs and secure efficient deployment.
  3. Design, improve, and manage data models, reporting platforms, data automation solutions, dashboards, and performance measures that back NVIDIA Infrastructure governance programs and strategic capacity decisions.
  4. Assess the technical and business requirements for GPU capacity and other compute resources from different internal and external groups.
  5. Identify performance bottlenecks in day-to-day usage of compute resources and collaborate with relevant infrastructure teams to resolve them.

Skills

Required

  • Bachelor's or Master's degree in Computer Science, Software Engineering, or a related field, or equivalent experience.
  • 8+ years of overall experience in cloud computing, specifically in managing or using GPU capacity for high performance computing.
  • Strong technical proficiency in cloud architecture, development and deployment, and managing large data sets.
  • Experience with command line interfaces and shell scripting languages.
  • Comprehensive knowledge of cloud service models (IaaS, PaaS, SaaS) and cloud infrastructure technologies.
  • Practical experience with Cloud Service Providers including AWS, Azure, GCP, and OCI is essential.
  • Demonstrated experience in bringing to bear AI tools and techniques to extract useful signals and insights from data, specifically to improve resource usage and automation.
  • Deep knowledge and active use of statistical modeling and machine learning approaches for boosting operational efficiency and supporting strategic capacity decisions.
  • Understanding of analytics, statistical modeling, and machine learning methodologies.
  • Strong communication and relationship-building skills, with the ability to work well across different departments and contribute to strategic decisions.
  • Self-starter, self-motivated, focused, and self-sufficient, with a willingness to learn new challenges and adapt quickly in a dynamic environment.
  • Ability to operate effectively amidst uncertainty and rapidly changing business conditions, with an agile approach and a commitment to ongoing improvement.

Nice to have

  • A proven record of large-scale computing operations and planning is a plus.

What the JD emphasized

  • Proven record of large-scale computing operations and planning is a plus
  • Demonstrated experience in bringing to bear AI tools and techniques to extract useful signals and insights from data, specifically to improve resource usage and automation.
  • Deep knowledge and active use of statistical modeling and machine learning approaches for boosting operational efficiency and supporting strategic capacity decisions.