Network Engineer, Capacity and Efficiency

Anthropic Anthropic · AI Frontier · San Francisco, CA · Compute

This role focuses on optimizing the network infrastructure that supports AI model training and inference. The engineer will build and manage the network observability stack, analyze traffic patterns for efficiency, own QoS and traffic engineering, and drive cost attribution for network spend. The goal is to ensure efficient data movement for AI workloads.

What you'd actually do

  1. Design and deploy telemetry pipelines — sFlow/IPFIX, gNMI streaming, eBPF host probes — that turn packet counters into per-flow, per-tenant, per-workload cost and utilization data. Own the SLIs for backbone and DCN fabric health.
  2. Analyze inter-region traffic patterns, identify hot links and stranded capacity, and quantify the dollar impact. Build the models that tell us whether we should buy more capacity, or move the workload.
  3. Design and operate traffic classification, marking, and shaping across the backbone. Make sure bulk checkpoint transfers don’t starve latency-sensitive inference, and that we’re not paying premium cross-region rates for traffic that could take the cheap path.
  4. Tie network spend — egress, interconnect ports, transit, optical leases — back to the teams and workloads that generate it. Make network cost a first-class input to capacity planning and workload placement decisions.
  5. Convince other teams to act on what your data shows: making the case to research that a traffic pattern needs to change, to finance that an interconnect tranche is worth buying, to Systems Networking that a QoS policy needs rewriting. You'll partner closely with Systems Networking on fabric architecture and Observability on telemetry platform integration, but the cost and efficiency wins will come from moving teams that don't report to you.
  6. Extend our intent-based network configuration systems and write the tooling that turns your efficiency findings into safe, reviewable, and impactful changes.

Skills

Required

  • 5+ years operating large-scale production networks (data center fabrics, backbone/WAN, or hyperscaler-adjacent)
  • BGP (policy and communities)
  • ECMP
  • VXLAN/EVPN or equivalent overlays
  • QoS (DSCP, queuing, shaping)
  • L1/optical basics (DWDM, coherent, LAGs)
  • Deep knowledge of at least one major CSP’s networking model (AWS or GCP)
  • Experience with network telemetry at scale (streaming telemetry, flow export, or eBPF-based host-side instrumentation)
  • Proficiency in Python or Go for building tooling and automation
  • Quantitative analysis and cost modeling
  • Clear communication skills

Nice to have

  • SRE experience for large-scale network infrastructure
  • Background on a cloud provider's networking team or product team

What the JD emphasized

  • network telemetry
  • observability
  • cost modeling
  • traffic engineering
  • Python
  • Go
  • 5+ years operating large-scale production networks
  • BGP
  • VXLAN/EVPN
  • QoS
  • eBPF