LLM Aiops Development Engineer - Data Center Networking

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Develops and implements an AIOps platform for data center networking, leveraging LLMs and agents for intelligent diagnostics, automated remediation, and predictive capabilities. Focuses on building a panoramic network observability platform and applying ML/DL for anomaly detection and root cause analysis.

What you'd actually do

  1. Develop a streaming telemetry data pipeline for both physical and virtual networks, integrating multi-source data from gNMI, Netconf, IPFIX/NetFlow, and SNMP to provide a high-quality, real-time data foundation for AIOps.
  2. Apply machine learning and deep learning algorithms to perform anomaly detection, correlation analysis, and intelligent noise reduction on massive volumes of network metrics, logs, and events.
  3. Build a conversational chatbot powered by Retrieval-Augmented Generation (RAG) that understands natural language queries, automatically queries knowledge bases and monitoring data, and provides precise troubleshooting guidance and network status reports.
  4. Train operational Agents to safely and controllably invoke network change tools and APIs.
  5. Forecast network capacity bottlenecks, high-risk links, and "sub-healthy" devices based on historical data and business growth models, enabling proactive scaling and preventative maintenance.

Skills

Required

  • Computer Science Fundamentals
  • Networking Fundamentals
  • Data Center Network Architectures
  • EVPN/VXLAN
  • BGP/OSPF
  • Linux network stack
  • Golang or Python
  • System Design
  • Microservices
  • Containerization (Docker/Kubernetes)
  • CI/CD
  • Kafka
  • Flink
  • ClickHouse/TSDB
  • Real-time data pipelines
  • Prometheus/OpenTelemetry
  • Graph databases (e.g., Neo4j)
  • Alert and event platforms
  • Large Models
  • Agent technologies
  • RAG
  • tool use
  • safety evaluation

Nice to have

  • hyperscale data center networks
  • LLM/Agent-based intelligent operations project
  • SONiC
  • P4/PINS
  • eBPF
  • Prometheus
  • OpenTelemetry
  • RDMA/RoCE
  • SmartNICs (NIC Offload)
  • DPDK/eBPF
  • network configuration and control systems

What the JD emphasized

  • deep networking expertise
  • innovative AIOps capabilities
  • autonomous data center networks
  • intelligent ecosystem with predictive and self-healing capabilities
  • real-time data foundation for AIOps
  • massive volumes of network metrics, logs, and events
  • Intelligent Operations Assistant
  • Automated Remediation and Smart Runbooks
  • highly available and scalable AIOps platform
  • online inference and automated closed-loop actions
  • Solid Fundamentals in Computer Science and Networking
  • Excellent Software Engineering Skills
  • Rich Platform Development Experience
  • A Passion for AIOps/ML/LLM Practices
  • keen interest in the latest advancements in Large Models and Agent technologies
  • hands-on experience in their application to operations
  • Proven experience leading or making significant contributions to an LLM/Agent-based intelligent operations project with measurable business impact

Other signals

  • LLM
  • Agents
  • AIOps
  • Observability
  • RAG
  • Automated Remediation