Capacity Operations and Analytics Manager

NVIDIA NVIDIA · Semiconductors · Sao Paulo, Brazil +1 · Remote

NVIDIA is seeking a Capacity Operations and Analytics Manager to optimize GPU and compute resource utilization across cloud providers. The role involves building data models, reporting systems, and dashboards to support capacity decisions, analyzing technical and business needs, identifying performance bottlenecks, and driving infrastructure resource efficiency. The manager will also develop tooling for cloud infrastructure and analytics, potentially leveraging AI techniques for insights, and partner with various teams to align capacity management with company goals. A strong background in cloud computing, GPU capacity management, data analytics, and statistical modeling is required, with experience in leveraging AI tools for resource optimization being a plus.

What you'd actually do

  1. Manage and optimize GPU capacity and other compute resources across various cloud service providers to meet growing demands and ensure efficient utilization.
  2. Build, develop, and maintain data models, reporting systems, data automation systems, dashboards, and performance metrics that support NVIDIA Infrastructure governance programs and strategic capacity decisions.
  3. Analyze the technical and business needs for GPU capacity and other compute resources from various internal and external teams.
  4. Identify performance bottlenecks in day-to-day usage of compute resources and collaborate with relevant infrastructure teams to resolve them.
  5. Drive infrastructure resource efficiency initiatives in partnership with engineering, finance, and product teams.

Skills

Required

  • Bachelor's or Master's degree in Computer Science, Software Engineering, or a related field, or equivalent experience.
  • 10+ years of overall experience in cloud computing, specifically in managing or sourcing GPU capacity with cloud service providers.
  • Strong technical proficiency in cloud architecture, development and deployment, and managing large data sets.
  • Deep understanding of cloud service models (IaaS, PaaS, SaaS) and cloud infrastructure technologies.
  • Experience with Cloud Service Providers such as AWS, Azure, GCP, and OCI is required.
  • Demonstrated experience in leveraging AI tools and techniques to extract useful signals and insights from data, specifically to improve resource usage and automation
  • Strong understanding and practical application of statistical modeling and machine learning methodologies for improving operational efficiency and informing strategic capacity decisions
  • Proficiency with data analytics, visualization, and monitoring tools such as Kibana, Grafana, Splunk, Prometheus, Tableau, Plotly.
  • Knowledge of analytics, statistical modeling, and machine learning methodologies.
  • Excellent communication and interpersonal skills, with the ability to collaborate effectively with various departments and influence strategic decisions.
  • Ability to operate effectively amidst uncertainty and rapidly changing business conditions, with an agile mindset and a commitment to ongoing improvement.

Nice to have

  • A proven track record of large-scale computing operations and planning is a plus.

What the JD emphasized

  • Experience with Cloud Service Providers such as AWS, Azure, GCP, and OCI is required.
  • Demonstrated experience in leveraging AI tools and techniques to extract useful signals and insights from data, specifically to improve resource usage and automation
  • Strong understanding and practical application of statistical modeling and machine learning methodologies for improving operational efficiency and informing strategic capacity decisions