Senior Solutions Architect, Infiniband and Networking Ethernet

NVIDIA NVIDIA · Semiconductors · Taipei, Taiwan

This role focuses on building and supporting AI/HPC infrastructure, specifically the networking aspects (Infiniband and Ethernet). Responsibilities include designing, deploying, and maintaining large-scale AI clusters, ensuring performance, monitoring, and reliability. It involves customer interaction, system design, automation, and providing feedback to internal teams. The role requires deep expertise in networking fundamentals, data center architecture, specific protocols (EVPN, BGP, OSPF, VXLAN), network platforms, and automation tools (Ansible, Python) for network provisioning and CI/CD.

What you'd actually do

  1. Primary responsibilities will include building AI/HPC infrastructure for new and existing customers.
  2. Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting.
  3. Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
  4. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  5. Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.

Skills

Required

  • Networking fundamentals
  • TCP/IP stack
  • Data center architecture
  • InfiniBand networks
  • EVPN
  • BGP
  • OSPF
  • VXLAN protocols
  • Network switch/router platforms (Cumulus Linux, SONiC, IOS, JunosOS, EOS)
  • Automated network provisioning
  • Ansible
  • Salt
  • Python
  • CI/CD pipelines for network operations
  • Customer needs and satisfaction focus
  • Self-motivated
  • Leadership skills
  • Collaborative work
  • English communication skills

Nice to have

  • Familiarity with cloud networks (AWS, GCP, Azure)
  • Linux or Networking Certifications
  • High-performance computing architectures
  • Job schedulers (Slurm, PBS)
  • Cluster management technologies (BCM)
  • GPU focused hardware/software

What the JD emphasized

  • At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture
  • Proficiency in configuring, testing, validating, and resolving issues in LAN and InfiniBand networks, especially in medium to large-scale HPC/AI environments.
  • Advanced knowledge of EVPN, BGP, OSPF, VXLAN protocols.
  • Hands-on experience with network switch/router platforms like Cumulus Linux, SONiC, IOS, JunosOS, and EOS.
  • Extensive experience delivering automated network provisioning solutions using tools like Ansible, Salt, and Python.
  • Ability to develop CI/CD pipelines for network operations.

Other signals

  • AI/HPC infrastructure
  • large-scale AI clusters
  • performance at scale
  • real-time monitoring
  • logging
  • alerting
  • network switch/router platforms
  • automated network provisioning
  • CI/CD pipelines for network operations