Site Reliability Engineer – Udf

F5 F5 · Enterprise · Seattle, WA

Site Reliability Engineer (SRE) role focused on designing, deploying, and supporting Kubernetes environments for AI workloads within the Unified Demo Framework (UDF) platform, specifically for F5 Guardrails and Redteam product lines. The role emphasizes operational excellence, automation, observability, and scalability to ensure the reliability and performance of AI features and F5 products.

What you'd actually do

  1. Design, deploy, and manage Kubernetes clusters and ensure efficient container orchestration to support AI workloads.
  2. Design and implement observability pipelines for real-time monitoring of Kubernetes clusters, including metrics collection for scaling, resource utilization, and system health.
  3. Automate infrastructure management tasks to support the efficient deployment and operation of AI functionalities, including upgrades, scaling, and provisioning.
  4. Collaborate with product teams and sales engineering to integrate F5 products into the UDF platform and ensure effective utilization by the sales organization.
  5. Support root cause analysis (RCA) processes for issues affecting the UDF platform, driving long-term corrective actions to improve system reliability.

Skills

Required

  • Kubernetes orchestration
  • containerized architectures
  • AWS usage
  • Kubernetes clusters
  • containerized workloads
  • EKS
  • monitoring and observability tools
  • CloudWatch
  • Grafana
  • Fluentd
  • DataDog
  • Infrastructure-as-Code (IaC) tools
  • Terraform
  • Helm
  • CloudFormation
  • CI/CD frameworks
  • networking
  • storage
  • compute infrastructure
  • Python
  • Go
  • Bash
  • automation
  • system integration
  • security best practices
  • data protection
  • resource access controls
  • GPU-based workloads
  • optimization strategies
  • orchestrating
  • troubleshooting
  • best practices
  • optimizing complex network environments
  • AWS VPCs
  • GCP VPCs

Nice to have

  • Certified Kubernetes Administrator (CKA)
  • Certified Kubernetes Application Developer (CKAD)
  • AWS Certified Solutions Architect
  • GCP Cloud Architect certifications
  • service mesh technologies (Istio, Linkerd)
  • Kubernetes operators for machine learning workflows
  • distributed computing concepts
  • large-scale AI workloads
  • observability and monitoring into pipelines for inference engines and machine learning models
  • hypervisors in GCP VPCs

What the JD emphasized

  • AI workloads
  • Kubernetes environments
  • operational excellence mindset
  • advancing the operational maturity and scalability
  • AI features
  • AI functionalities
  • AI based workloads

Other signals

  • support AI workloads
  • optimize system performance
  • ensure reliability in production environments
  • advancing the operational maturity and scalability of the UDF platform
  • incorporate new F5 product lines and features