Director, Global Network Reliability Engineering

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

NVIDIA is seeking a Director of Network Reliability Engineering to lead their global network operations, ensuring reliability, scalability, and efficiency. The role involves leading a team, implementing a data-driven approach with a focus on observability and continuous improvement, and leading the design and automation of operations. The successful candidate will influence network architecture and partner with engineering and operations teams.

What you'd actually do

  1. Your main focus will be maturing the current support model and processes to a more data driven, automated, SRE model.
  2. Build an in-house team of reliability experts for networking support and operations from the existing outsourced SMES , providing leadership, direction, and strategy for a growing team.
  3. Set the technical vision, strategy, and roadmap for network operations in partnership with the key infrastructure and partner teams.
  4. Work across Network Architecture, Network engineering and partner well to establish run books, regular training sessions and ensure we build the network to be self-healing.
  5. Work very well in understanding RCAs from events and incidents and work with our AI operations to enrich our observability tooling for better full stack view of the network to applications.

Skills

Required

  • Bachelor’s degree in Computer Science, related technical field, or equivalent experience
  • Experience building and growing teams that are geographically distributed , appreciate local operations and bring in a global perspective, following standards.
  • Ability to do technical deep-dives into code, networking, operating systems, and storage, as well as being verbally and cognitively agile enough to hold your own in strategy discussions with NVIDIA’s executive team and peer SMEs
  • Ability to identify trends and promote solutions that solve challenges efficiently across multiple product areas
  • Excellent innovative thinking, collaboration, and problem-solving skills.
  • 12+ overall years of experience with system design, network architecture, network engineering, and network operations and 7+ years Leadership of experience

Nice to have

  • Experience transforming network operations using software driven methods
  • Experience in a Hyperscale Cloud Service Provider (public facing or not)
  • Knowledge of SRE principles (observability, SLOs, SLIs, logging, etc)
  • Knowledge of software interface design & documentation for less technical end-users

What the JD emphasized

  • AI operations