Senior Manager, Artificial Intelligence - Machine Learning Platform

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2 · Remote

Senior Manager for AI/ML Platform at NVIDIA, leading the development and management of tools and services for the entire AI/ML project lifecycle, focusing on large-scale model training and deployment efficiency. Requires extensive experience in AI/ML infrastructure, team leadership, and strategic vision for AI platforms.

What you'd actually do

  1. Lead the strategic direction, development, and continuous improvement of the AI/ML platform, ensuring it meets the needs of internal researchers for large-scale model training and deployment.
  2. Optimize efficiency and resilience of different stages of ML workflow, including data ingestion, preprocessing, check-pointing, model training, deployment, and monitoring.
  3. Lead and mentor a team of highly skilled engineers, fostering a collaborative and high-performance culture.
  4. Work closely with various internal teams, including data scientists, researchers, and IT, to understand their needs and ensure the platform provides the necessary tools and resources.
  5. Drive innovation and efficiency within the platform, exploring and implementing new technologies and methodologies to enhance the capabilities and performance of the AI/ML ecosystem.

Skills

Required

  • AI/ML infrastructure
  • managing large-scale AI/ML projects and platforms
  • leading a highly technical, and collaborative team
  • AI/ML lifecycle management
  • high-performance computing
  • niche hardware
  • storage and networking
  • leadership and team management
  • leading multi-functional teams
  • leading sophisticated projects
  • developing and implementing a strategic vision for the AI/ML platform
  • communication and interpersonal skills
  • analytical and problem-solving abilities

Nice to have

  • developing and leading AI/ML platforms in a research or academic environment
  • cloud-based AI/ML platforms and infrastructure
  • drive efficiency and innovation within technical teams and projects

What the JD emphasized

  • leading development of AI/ML platform
  • supporting entire lifecycle of AI/ML projects
  • large-scale model training
  • deployment
  • efficiency and resilience
  • data ingestion
  • preprocessing
  • check-pointing
  • model training
  • deployment
  • monitoring
  • leading a highly technical, and collaborative team
  • AI/ML infrastructure
  • large-scale AI/ML projects and platforms
  • leading a highly technical, and collaborative team
  • AI/ML lifecycle management
  • high-performance computing
  • niche hardware
  • storage and networking
  • leading multi-functional teams
  • leading sophisticated projects
  • strategic vision for the AI/ML platform

Other signals

  • leading development of AI/ML platform
  • supporting entire lifecycle of AI/ML projects
  • enabling internal researchers to bring to bear very large-scale systems for training foundational models