Senior Software Engineer, Distributed Systems - Nim Factory

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +3 · Remote

Senior Software Engineer to design and build factory infrastructure and automation for NVIDIA Inference Microservices (NIMs). The role focuses on optimizing and serving performant inferencing for AI models in heterogeneous environments, building an efficient, scalable, and reliable automation factory to produce validated NIMs.

What you'd actually do

  1. Develop a factory pipeline that will take an AI model in and produce a deployable service that is validated across Cloud, On-prem and Kubernetes environments.
  2. Work with technical leaders designing and developing scalable and reliable factory components.
  3. Define metrics and drive improvements based on user feedback.

Skills

Required

  • advanced programming skills to build distributed and compute systems, backend services, microservices and cloud technologies
  • multi-functional teams, principals and architects, across organizational boundaries
  • Mentorship, growing teams and team members
  • Deep technical expertise in distributed containerize applications using technologies such as Docker, K8s, Cloud Endpoints, Helm, and Prometheus
  • Passion for building rich, microservice applications build and test automation pipeline
  • Excellent interpersonal skills and the ability to lead multi-functional efforts
  • Proven experience debugging and analyzing the performance of distributed microservices or cloud systems
  • BS or MS in Computer Science, Computer Engineering or related field (or equivalent experience)
  • 8+ years of shown experience developing performant microservice, cloud software and/or tooling roles

Nice to have

  • Experience delivering event-driven applications using various services such as Temporal, Kafka, Redis or others and a demonstrable ability to discuss the pros and cons of these choices.
  • A history of building and deploying containers for Microservices, Cloud and On-prem deployments, and their associated CI/CD pipelines
  • Prior experience in working with large scale full stack development

What the JD emphasized

  • design and build factory infrastructure and automation
  • optimize and serves performant inferencing for every AI model
  • design an efficient, scalable and reliable automation factory
  • build a highly efficient factory to power how NVIDIA builds and validates NIMs for inferencing
  • build the infrastructure that strives to accelerate the delivery of every AI model on NVIDIA's GPUs anywhere
  • design and build our factory capabilities, including the underlying infrastructure, pipelines, backends, Docker build, test harness, metrics, performance engineering, log ingestion, and more
  • factory pipeline
  • deployable service
  • validated
  • Cloud, On-prem and Kubernetes environments
  • technical strategies and roadmaps
  • designing interfaces, data modeling and schema design
  • expanding observability over the factory pipeline and its compute infrastructure
  • scalable and reliable factory components
  • build an efficient infrastructure
  • improves every teams' productivity
  • Define metrics and drive improvements based on user feedback
  • advanced programming skills to build distributed and compute systems, backend services, microservices and cloud technologies
  • Deep technical expertise in distributed containerize applications using technologies such as Docker, K8s, Cloud Endpoints, Helm, and Prometheus
  • Passion for building rich, microservice applications build and test automation pipeline
  • Proven experience debugging and analyzing the performance of distributed microservices or cloud systems
  • 8+ years of shown experience developing performant microservice, cloud software and/or tooling roles

Other signals

  • NVIDIA Inference Microservices (NIMs)
  • optimize and serves performant inferencing for every AI model
  • factory infrastructure and automation
  • deployable service that is validated across Cloud, On-prem and Kubernetes environments