Senior Server Ras Engineer

NVIDIA NVIDIA · Semiconductors · Bangalore, India

NVIDIA is seeking a Senior RAS Engineer to improve the reliability of their GPU and Grace systems by designing, architecting, and implementing robust RAS features. The role involves defining requirements, developing fault detection and recovery mechanisms, evaluating technologies, and collaborating with internal teams and external partners. The engineer will work on all phases of product development, from definition to customer support, in a Linux environment.

What you'd actually do

  1. Design, architect, and deliver server-level RAS for NVIDIA’s data center products.
  2. Define RAS requirements that ensure compliance with industry standards and customer expectations for scale-out environments.
  3. Develop fault detection, isolation, and recovery mechanisms to ensure system resilience and minimize downtime.
  4. Evaluate and select appropriate technologies and components to optimize reliability, availability, and serviceability, considering factors such as mean time between failures (MTBF), mean time to repair (MTTR), and total cost of ownership (TCO).
  5. Collaborate with customers, vendors and suppliers to assess and integrate their RAS-related solutions into the overall system architecture.

Skills

Required

  • BS, MS, or PhD or equivalent experience in EE/CS or related field of education with demonstrated experience of 10+ years
  • Strong python programming in Linux operating environment
  • strong understanding of Linux kernel internals
  • strong code review skills
  • Extensive knowledge in system-level architecture invention, reliability engineering, and fault tolerance mechanisms, optimizing RAS architectures for complex computing systems, data centers, or critical applications
  • Proficient in scale-out architectures
  • Proficiency in system-level simulation tools and methodologies (e.g., fault injection, reliability block diagrams, failure rate analysis)
  • Excellent problem-solving skills
  • attention to detail
  • ability to analyze complex system-level issues
  • Excellent written and oral communication skills
  • excellent work ethics
  • deep sense of collaboration
  • love to produce quality work
  • commitment to finishing your tasks every single day
  • self-starter who loves to find creative solutions to complicated problems

Nice to have

  • Consistent track record of doing RAS at platform level
  • Familiar with In-depth understanding of the interaction of machine check architecture and error flows with system firmware/software
  • Hands on with x86 or ARM system architecture

What the JD emphasized

  • extensive knowledge in system-level architecture invention, reliability engineering, and fault tolerance mechanisms, optimizing RAS architectures for complex computing systems, data centers, or critical applications
  • Proficiency in system-level simulation tools and methodologies (e.g., fault injection, reliability block diagrams, failure rate analysis)
  • Consistent track record of doing RAS at platform level