Senior Research-ops & Devops Engineer

NVIDIA NVIDIA · Semiconductors · Yokneam, Israel +1

NVIDIA is seeking a Senior Software Engineer for their Video/Multimedia A&A team to lead infrastructure and operations. This role involves setting up compute resources (on-prem and cloud), developing distributed pipelines for large-scale regressions and experiments across hardware simulations and ML workloads, and maintaining CI/CD and development environments. The goal is to transform research workflows into reliable, automated systems.

What you'd actually do

  1. Work closely with our Architects and Algorithms Engineers to understand the needs and transform one-off research workflows into dependable, consistent, automated systems
  2. Stand up and operate the compute the group runs on — on-prem GPU clusters, cloud bursts, queues, schedulers (Slurm / Kubernetes), container images, environments
  3. Develop and build decentralized workflows for extensive regression testing and experiments across HW-simulations and ML workloads — and the dashboards that make sense of the results
  4. Lead the team’s CI/CD plus the dev environments, container images and tooling everyone in the group lives in every day

Skills

Required

  • B.Sc. in Computer Science or Electrical/Computer Engineering
  • 5+ years in a DevOps, SRE, MLOps, Research-Ops or platform-engineering role
  • Strong Linux fundamentals
  • Strong hands-on experience in Python
  • Hands-on experience with at least one major Cloud ecosystem (OCI, AWS, Azure, GCP)
  • Infrastructure as Code (Terraform, Pulumi or similar)
  • Containers and orchestration: Docker plus Kubernetes, and/or HPC schedulers like Slurm
  • Experience designing and bringing up CI/CD flows at scale (GitLab CI, GitHub Actions, Jenkins or similar)

Nice to have

  • Familiarity with video compression / codecs (NVENC, NVDEC, FFmpeg, GStreamer)
  • GPU-aware infrastructure experience: CUDA toolkit installs, driver versioning, MIG, NCCL
  • Reading-level comfort with C++
  • Observability experience — Prometheus, Grafana, OpenTelemetry, structured logging

What the JD emphasized

  • ML workloads
  • research workflows
  • CI/CD
  • DevOps
  • SRE
  • MLOps
  • Research-Ops

Other signals

  • ML workloads
  • research workflows
  • CI/CD
  • DevOps
  • SRE
  • MLOps
  • Research-Ops