Senior Site Reliability Engineer, Geforce Now

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior Site Reliability Engineer for NVIDIA's GeForce Now cloud gaming service, focusing on reliability, uptime, and performance of large-scale distributed microservices. Responsibilities include building SRE observability tools, Kubernetes migration, incident management, automation, and supporting services through design, capacity planning, and launch reviews. Requires strong Kubernetes, SRE experience, coding proficiency in Go/Python/Bash, and on-call experience.

What you'd actually do

  1. Working on building tools to improve the SRE Observability.
  2. Be part of the Kubernetes migration journey with VMI setup and problem solving.
  3. Rapidly debug and triage incidents and user-reported issues
  4. Taking ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve 100% automation of daily tasks
  5. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.

Skills

Required

  • Site reliability engineering
  • Large scale distributed micro services
  • Automation
  • Tooling
  • Kubernetes
  • VMI setup
  • Incident management
  • Change management
  • Post-mortem reviews
  • Workflow processes
  • Software automation
  • Problem-solving
  • Root cause analysis
  • Optimization
  • Efficiency
  • Datadog
  • Prometheus
  • Alertmanager
  • Multi-region cloud deployments
  • AWS
  • GCP
  • Azure
  • Deployment pipelines
  • GitHub Actions
  • GitLab CI
  • ArgoCD
  • Go
  • Python
  • Bash scripting
  • Production on-call experience

Nice to have

  • Automated anomaly detection
  • Log clustering tools
  • LLM-assisted debugging platforms
  • AI on a day-to-day basis as an SRE
  • Prior experience as an SRE or Service Engineer

What the JD emphasized

  • 8+ year’s Site reliability engineering experience
  • Very strong Kubernetes background
  • Production on-call experience is a must