Senior Site Reliability Engineer, AI Inference

F5 F5 · Enterprise · Dublin, Ireland

The AI Inference Engineer optimizes Large Language Models (LLMs) for inference across diverse environments, focusing on maximizing throughput, minimizing latency, and maintaining model accuracy. This role involves building and maintaining inference engines, optimizing models for specialized hardware, designing auto-scaling architectures, and establishing robust observability frameworks for enterprise-grade reliability.

What you'd actually do

  1. Build and maintain robust inference engines using tools like vLLM, TGI (Text Generation Inference), and NVIDIA Triton, ensuring high performance at scale.
  2. Handle deployment optimizations to deliver low-latency AI serving solutions for multiple business applications.
  3. Profile and optimize models for specialized hardware backends, including NVIDIA GPUs (CUDA/TensorRT), Apple Silicon (CoreML), and AI accelerators like TPUs and LPUs.
  4. Design and implement auto-scaling architectures for online (real-time) and batch inference pipelines, leveraging Kubernetes for inference routing and orchestration.
  5. Establish robust observability frameworks to monitor Time to First Token (TTFT), tokens per second, and memory bandwidth utilization against service-level agreements (SLAs).

Skills

Required

  • Python
  • C++
  • Rust
  • Golang
  • vLLM
  • TensorRT
  • Llama.cpp
  • Ollama
  • Docker
  • Kubernetes
  • AWS
  • GCP
  • Azure
  • NVIDIA GPUs
  • TPUs

Nice to have

  • Speculative Decoding
  • PagedAttention
  • open-source inference libraries
  • hardware-level kernel development
  • CUDA
  • Triton kernels
  • MLOps
  • SRE
  • high-throughput inference environments
  • traffic bursts

What the JD emphasized

  • low-latency AI serving solutions
  • maximize utilization and performance
  • auto-scaling architectures
  • peak performance during traffic spikes
  • robust observability frameworks
  • performance and load testing suites
  • high-performance AI workflows
  • inference development and optimization
  • high-performance AI endpoints
  • high-throughput inference environments
  • traffic bursts
  • Latency Reduction
  • Cost Efficiency
  • Scalability
  • traffic spikes
  • Throughput Maximization
  • AI optimization
  • high-performance engineering
  • real-time AI applications
  • scalability
  • real-time AI reliability
  • MLOps solutions
  • enterprise AI systems
  • low-latency, scalable, and high-performing AI prediction systems

Other signals

  • optimize LLMs for inference
  • maximize throughput, minimize latency
  • enterprise-grade reliability
  • hardware acceleration
  • scalable infrastructure
  • monitoring system performance