Principal Network Engineer - AI Infrastructure

CVS Health CVS Health · Healthcare · Albany, NY +52 · Innovation and Technology

The Principal Network Engineer - AI Infrastructure is responsible for designing, implementing, and optimizing high-performance network infrastructure for AI and GPU-driven workloads, including data center networks, leaf-spine fabrics, and EVPN/VXLAN. This role involves collaborating with compute, storage, and security teams to support large-scale training and inference platforms, ensuring network performance, scalability, and security. The position also involves strategic planning, mentorship, and evaluating emerging AI infrastructure technologies.

What you'd actually do

  1. Design and implement high-performance data center networks optimized for AI/GPU workloads, including leaf‑spine and EVPN/VXLAN fabrics.
  2. Integrate networking with GPU clusters and high-performance storage systems supporting training and inference workloads.
  3. Optimize network performance (latency, throughput, congestion) for large-scale distributed environments.
  4. Define and drive the data center network strategy supporting AI/ML platforms and business initiatives.
  5. Partner with compute, storage, platform, and security teams to design integrated AI infrastructure solutions.

Skills

Required

  • 10+ years of experience in network engineering, with at least 5+ years in a leadership, architectural, or lead engineering role delivering enterprise or cloud network initiatives end-to-end.
  • 5+ years of experience designing and operating large-scale data center networks, including Layer 2/3 architectures (leaf-spine/Clos), EVPN/VXLAN overlays, and high-speed networking (100/200/400Gb+).
  • 5+ years of experience with enterprise routing, switching, and network platforms, including Cisco-centric data center fabrics, protocols (BGP, OSPF, MPLS, STP), and hybrid connectivity (SD-WAN, VPN, remote access).
  • 5+ years of experience implementing network security technologies, including Palo Alto Networks firewalls (required), NGFW, IDS/IPS, ZTNA, DLP, and micro-segmentation, with understanding of application-aware and zero trust architectures.
  • 3+ years of experience supporting AI/ML or GPU-based environments, including NVIDIA reference architectures and performance-optimized networking for distributed training workloads (e.g., traffic flow optimization, congestion management).
  • 3+ years of experience with application delivery and observability technologies, including F5 load balancing, network performance monitoring tools (e.g., NetFlow, Wireshark, SolarWinds), and traffic analysis for performance tuning.

Nice to have

  • Experience designing and supporting AI factory / GPU cluster environments at scale (training and inference platforms).
  • Familiarity with high-performance compute networking enhancements (RDMA over Converged Ethernet – RoCE, PFC, ECN).
  • Experience with Cisco Nexus, ACI, or equivalent data center switching platforms supporting AI workloads.
  • Strong technical expertise with Networking and Software-Defined Networking (SDN) principles.
  • Strong technical expertise with developing and interpreting Network, Sequence, and Dataflow diagrams.
  • Understanding of at least one compliance framework (HIPAA, HITRUST, PCI, NIST, CSA).
  • Strong technical expertise in defining and implementing cyber resilience standards, policies, and programs for distributed cloud and network infrastructure, ensuring robust redundancy and system reliability.
  • Experience in influencing industry standards and contributing to open-source projects or security communities, highlighting a broader impact beyond the immediate organizations.

What the JD emphasized

  • 5+ years of experience supporting AI/ML or GPU-based environments, including NVIDIA reference architectures and performance-optimized networking for distributed training workloads (e.g., traffic flow optimization, congestion management).
  • 5+ years of experience implementing network security technologies, including Palo Alto Networks firewalls (required), NGFW, IDS/IPS, ZTNA, DLP, and micro-segmentation, with understanding of application-aware and zero trust architectures.

Other signals

  • designing and delivering scalable data center solutions that support large-scale training and inference platforms
  • high-performance network infrastructure that powers the organization’s AI and GPU-driven workloads
  • integrating networking with GPU clusters and high-performance storage systems supporting training and inference workloads