Network Engineer, Engineering R&d Environments

Meta Meta · Big Tech · Menlo Park, CA

Meta is seeking a Network Engineer to build and scale network infrastructure for AI and compute lab clusters. This role involves end-to-end network design, deployment, and operations for backend fabrics supporting AI workloads, including high-throughput, low-latency cluster networking. Responsibilities include troubleshooting, supporting hardware/software bring-ups, automation, and collaborating with cross-functional teams. The role requires experience with AI/ML cluster networking and a demonstrated ability to integrate AI tools into workflows.

What you'd actually do

  1. Own end-to-end frontend and backend network design, deployment, and operations for AI and compute lab clusters
  2. Serve as a primary networking point of contact for backend fabrics, including Arista- and internally developed network OS-based scale-out networks supporting AI workloads
  3. Design, deploy, and support high-throughput, low-latency cluster networking, including congestion management (PFC/ECN), RDMA validation, and lossless transport
  4. Perform hands-on troubleshooting and root-cause analysis across L1–L4 using packet captures, telemetry, and vendor tools to resolve complex lab issues
  5. Support silicon, hardware, and software bring-ups, ensuring reliable connectivity and on-time validation

Skills

Required

  • 6+ years of experience designing, deploying, and operating network infrastructure in production or lab environments
  • Experience working in multi-vendor environments, including Arista, FBOSS-based platforms, and lab networking hardware
  • Experience with configuration management, code repositories, and zero-touch provisioning (ZTP) for network infrastructure
  • Experience with IPv4/IPv6, L2/L3 protocols, including STP, OSPF, BGP, TCP/IP, DHCP, DNS, VLANs, VRRP, LACP, MC-LAG, ACLs, MACsec, and EVPN/VXLAN
  • Working knowledge of scripting or programming languages (e.g., Python, shell) for automation and tooling
  • Demonstrated experience to operate consistently while working under your own initiative, seeking feedback and input where appropriate in a global, time-critical environment, managing multiple priorities and mission-critical timelines
  • Understanding of physical infrastructure design, including structured cabling, space, power, and cooling systems
  • Networking L1 expertise in validating multi-vendor optics, with proficiency using the BCM shell and I2C utilities to troubleshoot hardware-level issues
  • Experience with network automation, CI/CD pipelines, audit frameworks, and validation tooling
  • Hands-on experience with backend cluster networking, including scale-out fabrics, RDMA networks, and congestion management
  • Experience supporting AI/ML or high-performance compute clusters in lab or pre-production environments
  • Hands-on experience with lab test equipment, optics qualification (e.g., 400G/800G), optical switches and physical infrastructure
  • Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
  • Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
  • Hands-on experience with disaggregated networking products and software, such as Meta's open network OS (FBOSS), SONiC, Cumulus Linux, or equivalent open networking platforms

Nice to have

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Networking certifications such as CCIE, JNCIE or equivalent
  • Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

What the JD emphasized

  • AI workloads
  • AI and compute lab clusters
  • high-throughput, low-latency cluster networking
  • AI/ML or high-performance compute clusters
  • responsible, ethical AI practices
  • Demonstrated ongoing AI skill development
  • integrate AI tools to optimize/redesign workflows

Other signals

  • AI workloads
  • AI and compute lab clusters
  • AI/ML or high-performance compute clusters