Distinguished Site Reliability Engineer - Cloud

NVIDIA NVIDIA · Semiconductors · WA +5 · Remote

Distinguished Site Reliability Engineer (SRE) at NVIDIA responsible for designing, building, and maintaining large-scale production systems, focusing on high efficiency, availability, and reliability of GPU cloud services. The role involves leading operational and reliability aspects of Kubernetes clusters, engaging in the full service lifecycle, supporting services before and after launch, scaling systems through automation, and practicing sustainable incident response. Requires extensive experience in infrastructure automation, distributed systems, and Linux/Networking/Containers, with proficiency in languages like Python or Go.

What you'd actually do

  1. Lead, design, implement and support operational and reliability aspects of large scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
  2. Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
  3. Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
  4. Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  5. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity

Skills

Required

  • Infrastructure automation
  • distributed systems design
  • design, develop tools for running large scale private or public cloud system in Production
  • Python
  • Go
  • Perl
  • Ruby
  • Linux
  • Networking
  • Containers

Nice to have

  • crafting, analyzing and fixing large-scale distributed systems
  • Systematic problem-solving
  • strong communication skills
  • sense of ownership and drive
  • debug and optimize code
  • automate routine tasks
  • using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker

What the JD emphasized

  • large scale production systems
  • high efficiency and availability
  • Kubernetes
  • OpenStack
  • GPU cloud services
  • automation
  • performance tuning
  • capacity management
  • latency
  • performance
  • incident response
  • blameless postmortems