Senior Solution Architect, AI Compute Engineer - Nvis

NVIDIA NVIDIA · Semiconductors · Australia · Remote

Senior Solution Architect, AI Compute Engineer at NVIDIA, focusing on deploying, managing, and maintaining AI/HPC infrastructure in Linux environments for customers. The role involves customer interaction, system design, automation, and providing feedback to internal teams. Requires strong Linux system administration, scripting, and cluster management skills, with a preference for experience in distributed computing, high-speed networking, automation tools, and Kubernetes for AI/ML workloads.

What you'd actually do

  1. Primary responsibilities will include deploying, managing and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.
  2. Be the domain expert with customers during planning calls through implementation.
  3. Handover-related documentation and perform knowledge transfers required to support customers as they begin rolling out some of the most sophisticated systems in the world!
  4. Provide feedback into internal teams such as opening bugs, documenting workarounds, and suggesting improvements.

Skills

Required

  • Linux System Administration
  • process management
  • package management
  • task scheduling
  • kernel management
  • boot procedures/troubleshooting
  • performance reporting/optimization/logging
  • network routing/advanced networking (tuning and monitoring)
  • Cluster management technologies
  • Scripting proficiency
  • interpersonal skills
  • verbal and written English skills
  • organizational skills
  • prioritize/multi-task
  • Linux certifications
  • Schedulers such as SLURM, LSF, UGE

Nice to have

  • MPI (e.g., OpenMPI, MPICH)
  • distributed communication programming
  • cluster debugging
  • NCCL principles and applications
  • collective communication optimization for NVIDIA GPU clusters
  • deploying and optimizing high-speed networks (InfiniBand/Ethernet)
  • network architecture impacts GPU cluster performance
  • automation tools (Ansible, Salt, Puppet, etc.)
  • batch configuration and operational automation for GPU clusters and LLM deployment environments
  • Kubernetes
  • container orchestration for AI/ML workloads
  • resource scheduling
  • scaling
  • integration with HPC environments

What the JD emphasized

  • customer-blocking issues

Other signals

  • deploying, managing and maintaining AI/HPC infrastructure
  • customer-focused team
  • large-scale AI/HPC projects