Senior Sdn Development Engineer (management Plane)

Crusoe · Data AI · San Francisco, CA - US · Cloud Engineering

This role is for a Software Engineer focused on building the management plane for a global network fabric, involving telemetry pipelines, traffic engineering logic, data modeling, API development, and performance optimization. The role requires strong systems programming skills in Go or C++, network fundamentals, and experience with modern observability and cloud-native tools. While the company is an AI infrastructure company, this specific role is focused on the underlying network systems and not directly on AI/ML model development or deployment.

What you'd actually do

  1. Build Telemetry Pipelines: Design and implement high-throughput data ingestion services using gNMI, gRPC, and P4-Runtime.
  2. Develop Traffic Engineering (TE) Logic: Write the algorithms and controllers that interface with Segment Routing (SRv6/SR-MPLS) to dynamically reroute traffic based on real-time link health.
  3. Data Modeling: Architect and maintain structured data models using YANG, ensuring a "Single Source of Truth" for network state.
  4. API Ecosystem: Develop internal RESTful and gRPC APIs that allow other services to query network topology and performance metrics programmatically.
  5. Performance at Scale: Optimize Go/Python services to handle high-cardinality time-series data without introducing latency into the management loop.

Skills

Required

  • Go (Golang) or C++
  • TCP/IP stack
  • TSDBs (Prometheus, VictoriaMetrics, or InfluxDB)
  • OpenTelemetry
  • Kubernetes
  • Kafka or RabbitMQ

Nice to have

  • eBPF
  • Batfish
  • Forward Networks

What the JD emphasized

  • Build the software layer responsible for the "Closed-Loop Automation" of our global fabric
  • building the systems that ingest billions of telemetry events and programmatically steer traffic in real-time to optimize performance and availability
  • High-throughput data ingestion services
  • dynamically reroute traffic based on real-time link health
  • handle high-cardinality time-series data without introducing latency into the management loop
  • memory safety and concurrency
  • deep understanding of the TCP/IP stack
  • building or extending TSDBs
  • using OpenTelemetry