Senior Hpc Storage Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

NVIDIA is seeking a Senior HPC Storage Engineer to lead research, design, and implementation of fast storage solutions for high-performance computing and computationally intensive workloads. The role involves identifying architectural changes for file, block, and object storage, developing automation tooling, and collaborating with teams to understand developer workflows and infrastructure requirements. The position also supports researchers by optimizing deep learning workflows and performing root cause analysis for storage issues.

What you'd actually do

  1. Research and analyze existing internal distributed storage services.
  2. Research, design, and implement scalable, next-gen distributed storage services for HPC workloads, optimizing both performance and cost-effectiveness to meet NVIDIA’s growing infrastructure needs
  3. Develop tooling to automate management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
  4. Detail the general procedures and practices, perform technology evaluations, related to distributed file systems.
  5. Collaborate across teams to better understand developers' workflows and capture their infrastructure requirements.

Skills

Required

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
  • 8+ years of experience designing and/or operating large scale storage infrastructure.
  • Experience analyzing and tuning storage performance for a variety of workloads.
  • Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting
  • In depth understanding of container technologies like Docker, Enroot

Nice to have

  • Distributed Storage Expertise: Extensive experience with parallel and distributed filesystems (Ceph, Weka.io, Vast, Lustre, GPFS) and Linux storage kernel development.
  • GPU & AI Infrastructure: Proficient with NVIDIA GPUs, CUDA programming, and NCCL, including performance benchmarking via MLPerf.
  • Hardware & Storage Engineering: Deep familiarity with storage hardware (HDDs, SSDs, NVMe), enclosures, and specialized appliances like Network Appliance.
  • Advanced Networking: Strong background in Software Defined Networking (SDN) and high-performance networking for AI/HPC clusters.
  • Deep Learning Frameworks: Practical experience applying industry-standard frameworks, specifically PyTorch and TensorFlow.

What the JD emphasized

  • 8+ years of experience designing and/or operating large scale storage infrastructure.
  • Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting
  • Extensive experience with parallel and distributed filesystems (Ceph, Weka.io, Vast, Lustre, GPFS) and Linux storage kernel development.
  • Proficient with NVIDIA GPUs, CUDA programming, and NCCL, including performance benchmarking via MLPerf.
  • Deep familiarity with storage hardware (HDDs, SSDs, NVMe), enclosures, and specialized appliances like Network Appliance.
  • Strong background in Software Defined Networking (SDN) and high-performance networking for AI/HPC clusters.
  • Practical experience applying industry-standard frameworks, specifically PyTorch and TensorFlow.