Principal Sre (networking) - Platform Control Plane

Elastic Elastic · Enterprise · United States · Platform - Cross Team

Principal SRE role focused on networking infrastructure for Elastic's multi-cloud platform, supporting their Search AI services. The role involves designing, building, and automating network infrastructure, focusing on reliability, security, and scaling.

What you'd actually do

  1. Taking an engineering approach in leading technical initiatives for designing, building and automating network infrastructure and services to guarantee the reliability of the global Elastic network infrastructure. Focusing on Layer 2/3/4 of the TCP/IP stack (Ethernet and/or IP encapsulation, routing, firewalling, load balancing).
  2. Growing our global Platform network infrastructure to meet the increasing scaling demands by Developing and maintaining software, codebases, tooling and automations to serve our Network Infrastructure as Code principle.
  3. Collaborating in an environment with an inclusive approach, and focusing on operational excellence which uplifts others.
  4. Preventing repeated customer impact in response to major incidents and prioritised problem management. Our on call rotation is spread well, and we address complex customer concerns too.

Skills

Required

  • networking skills
  • IP/IPv6
  • TCP/UDP
  • BGP
  • DNS
  • building and automating networks
  • Terraform
  • Ansible
  • public CSP network components
  • Load balancers
  • VPC peering/Transit gateways
  • VPN connectivity
  • Direct Connects
  • Site-Reliability Engineering experience
  • Linux
  • distributed systems

Nice to have

  • SaaS product operation
  • Infrastructure-as-Code
  • Crossplane
  • dynamic routing
  • BGP
  • software routers
  • IP address management (IPAM)
  • overlay networks
  • encapsulation protocols
  • IPSec
  • GRE
  • VXLAN
  • Kubernetes-at-scale infrastructure
  • Cilium CNI
  • Golang
  • containerized services
  • Docker
  • alerting
  • major incident management
  • Elastic Stack
  • Graphite
  • Prometheus
  • Influx
  • system and network administration
  • Elastic Stack
  • self-organizing and sharing
  • globally distributed team environment
  • coaching and mentoring

What the JD emphasized

  • reliability
  • operational excellence
  • major incidents
  • problem management
  • customer first approach