LLM Aiops Development Engineer - Data Center Networking

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Develops an AIOps platform for data center networking, focusing on building an intelligent diagnostics system, exploring LLM/Agent applications for operations, and establishing capacity prediction. Integrates streaming telemetry and applies ML/DL for anomaly detection and root cause analysis.

What you'd actually do

  1. Build a Panoramic Network Observability Platform: Develop a streaming telemetry data pipeline for both physical and virtual networks, integrating multi-source data from gNMI, Netconf, IPFIX/NetFlow, and SNMP to provide a high-quality, real-time data foundation for AIOps.
  2. Develop an Intelligent Diagnostics and Root Cause Analysis System: Apply machine learning and deep learning algorithms to perform anomaly detection, correlation analysis, and intelligent noise reduction on massive volumes of network metrics, logs, and events. Swiftly pinpoint root causes of failures across the entire stack, from optical transceivers and switch hardware to protocol adjacencies and application traffic.
  3. Explore Innovative Applications of LLMs and Agents: Intelligent Operations Assistant: Build a conversational chatbot powered by Retrieval-Augmented Generation (RAG) that understands natural language queries, automatically queries knowledge bases and monitoring data, and provides precise troubleshooting guidance and network status reports.
  4. Automated Remediation and Smart Runbooks: Train operational Agents to safely and controllably invoke network change tools and APIs. Empower them to autonomously generate, recommend, or even execute remediation plans and emergency runbooks based on their understanding of failure scenarios.
  5. Establish Capacity and Risk Prediction Capabilities: Forecast network capacity bottlenecks, high-risk links, and "sub-healthy" devices based on historical data and business growth models, enabling proactive scaling and preventative maintenance.

Skills

Required

  • Computer Science Fundamentals
  • Networking Fundamentals
  • Data Center Network Architectures
  • EVPN/VXLAN
  • BGP/OSPF
  • Linux Network Stack
  • Golang
  • Python
  • System Design
  • Microservices
  • Containerization (Docker/Kubernetes)
  • CI/CD
  • Big Data Processing (Kafka, Flink, ClickHouse/TSDB)
  • Real-time Data Pipelines
  • Analytics Systems
  • Observability Technologies (Prometheus/OpenTelemetry)
  • Graph Databases (Neo4j)
  • Alert and Event Platforms
  • Large Models
  • Agent Technologies
  • RAG
  • Tool Use
  • Safety Evaluation

Nice to have

  • Hyperscale Data Center Networks
  • LLM/Agent-based intelligent operations project
  • Open-source contributions (SONiC, P4/PINS, eBPF, Prometheus, OpenTelemetry)
  • High-performance networking (RDMA/RoCE)
  • SmartNICs (NIC Offload)
  • DPDK/eBPF
  • Network configuration and control systems (SONiC, gNMI, Netconf)

What the JD emphasized

  • deep understanding of data center network architectures
  • proficiency in key protocols such as EVPN/VXLAN and BGP/OSPF
  • In-depth knowledge of the Linux network stack is essential
  • Mastery of Golang or Python with outstanding coding and system design abilities
  • practical experience in one or more of the following areas is highly desirable: Big Data Processing
  • Observability Technologies
  • A keen interest in the latest advancements in Large Models and Agent technologies
  • Proven experience leading or making significant contributions to an LLM/Agent-based intelligent operations project with measurable business impact

Other signals

  • LLM
  • Agents
  • RAG
  • Observability
  • Networking