Solutions Architect, Networking Ethernet

NVIDIA NVIDIA · Semiconductors · Australia · Remote

This role focuses on building and operating networking infrastructure for AI/HPC systems, supporting operational aspects, and engaging in the full lifecycle of services. It requires strong networking fundamentals, automation skills, and experience with large-scale deployments.

What you'd actually do

  1. Primary responsibilities will include building and operating AI/HPC infrastructure for new and existing customers.
  2. Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting.
  3. Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
  4. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health with an emphasis on services performance and availability optimisations to meet requirements and SLAs.
  5. Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.

Skills

Required

  • networking fundamentals
  • TCP/IP stack
  • data center architecture
  • configuring, testing, validating, and resolving issues in LAN networks
  • EVPN, BGP, OSPF, VXLAN protocols
  • network switch/router platforms (Cumulus Linux, SONiC, IOS, JunosOS, and EOS)
  • automated network provisioning solutions (Ansible, Salt, Python)
  • CI/CD pipelines for network operations
  • customer needs and satisfaction
  • collaboration

Nice to have

  • NVIDIA Spectrum networking
  • cloud networks (AWS, GCP, Azure)
  • RDMA-based fabrics (InfiniBand or RoCE)
  • High-performance computing architectures
  • Kubernetes
  • container based microservices
  • job schedulers (Slurm, PBS)
  • Cluster management technologies (BCM, Run.AI)
  • GPU-focused hardware and software (NVIDIA DGX, CUDA, Network/GPU Operator)

What the JD emphasized

  • at least 5+ years of professional experience in networking fundamentals
  • Proficiency in configuring, testing, validating, and resolving issues in LAN networks, especially in medium to large-scale HPC/AI environments.
  • Advanced knowledge of EVPN, BGP, OSPF, VXLAN protocols.
  • Hands-on experience with network switch/router platforms like Cumulus Linux, SONiC, IOS, JunosOS, and EOS.
  • Extensive experience delivering automated network provisioning solutions using tools like Ansible, Salt, and Python.
  • Ability to develop CI/CD pipelines for network operations.