Senior Infrastructure Software Engineer, Deep Learning Libraries

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Software Engineer to build and maintain scalable infrastructure for NVIDIA's deep learning libraries (cuDNN, TensorRT, CUDA), focusing on build, test, integration, and release automation across diverse platforms.

What you'd actually do

  1. Designing and developing software for testing and analysis of our codebases
  2. Building scalable automation for build, test, integration, and release processes for publicly distributed deep learning libraries
  3. Developing throughout the software stack, from the user experience and user interfaces down to the cluster and database layers
  4. Configuring, maintaining, and building upon deployments of industry-standard tools (e.g. Kubernetes, Jenkins, Docker, CMake, Gitlab, Jira, etc.)
  5. Develop front-end solutions using HTML, CSS, JavaScript, and related web technologies

Skills

Required

  • Masters Degree in Computer Science or Computer Engineering or equivalent experience
  • 3+ years of relevant experience
  • Strong programming skills in Python (or similar)
  • familiarity with C/C++ development
  • Experience setting up, maintaining, and automating continuous integration systems (e.g. Jenkins, GitHub Actions, GitLab pipelines, Azure DevOps)
  • Experience in HTML5, CSS, NodeJS, or React
  • Fluency in SCM (e.g. Git, Perforce)
  • Fluency in build systems (e.g. Make, CMake, Bazel)
  • Background with distributed systems
  • Background with cluster/cloud computing
  • Background with Kubernetes

Nice to have

  • Prior experience designing and developing automation in Jenkins with Groovy (or similar)
  • Track record of identifying useful new technologies and incorporating them into SW development flows
  • A strong understanding of unit and integration test frameworks and experience with crafting them
  • Experience with mobile/embedded platforms and multiple operating systems (Ubuntu, RedHat, Windows, QNX, or similar)

What the JD emphasized

  • deep learning libraries
  • scalable automation
  • build, test, integration, and release processes
  • distributed systems
  • cluster/cloud computing