Software Engineer - AI Research Clusters

NVIDIA · Semiconductors · Santa Clara, CA +5 · Remote

Software Engineer to build and maintain GPU clusters for internal AI researchers, focusing on reliability, performance, and self-service. The role involves applying AIOps and Agentic AI to reduce operational toil and support the training, fine-tuning, and deployment of advanced ML models.

What you'd actually do

  1. propose and implement engineering solutions to ensure delivery of functional, reliable, secure, and performance-optimal GPU clusters to internal researchers
  2. reduce operational disruption and overhead
  3. empower them for self-service continuous improvement on reliability, operational excellence & performance
  4. design, develop and maintain engineering solutions to solve those pain points systematically
  5. research in traditional AIOps and the emerging Agentic AI, and leverage it to further reduce the operation toil

Skills

Required

  • BS/MS in Computer Science, Engineering, or equivalent experience
  • 2+ years in software/platform engineering
  • 1 year in ML infrastructure or distributed systems
  • Experience in software development lifecycle on Linux-based platforms
  • Strong coding skills in languages such as Python, C++ or Rust
  • Experience with Docker, Kubernetes, GitLab CI, automated deployments
  • Experience with AIOps or Agentic AI

Nice to have

  • Proficiency with full-stack development: Relational Data Modeling, DB optimization, REST API Semantics, Javascript, CSS, providing API as a service
  • Passion for building developer-centric platforms with great UX and strong operational reliability
  • Experience running Slurm or custom scheduling frameworks in production ML environments
  • Familiarity with GPU computing, Linux systems internals, and performance tuning at scale

What the JD emphasized

  • apply it successfully in production environment

Other signals

  • GPU clusters
  • ML infrastructure
  • AIOps
  • Agentic AI