AI Inference Engineer

F5 F5 · Enterprise · Dublin, Ireland

AI Inference Engineer responsible for optimizing Large Language Models (LLMs) for inference across various environments, focusing on maximizing throughput, minimizing latency, and maintaining accuracy. This role involves building and maintaining inference engines, optimizing for hardware acceleration, designing auto-scaling architectures, and establishing performance monitoring frameworks.

What you'd actually do

  1. Build and maintain robust inference engines using tools like vLLM, TGI (Text Generation Inference), and NVIDIA Triton, ensuring high performance at scale.
  2. Profile and optimize models for specialized hardware backends, including NVIDIA GPUs (CUDA/TensorRT), Apple Silicon (CoreML), and AI accelerators like TPUs and LPUs.
  3. Design and implement auto-scaling architectures for online (real-time) and batch inference pipelines, leveraging Kubernetes for inference routing and orchestration.
  4. Establish robust observability frameworks to monitor Time to First Token (TTFT), tokens per second, and memory bandwidth utilization against service-level agreements (SLAs).
  5. Build and execute performance and load testing suites to identify bottlenecks and ensure consistent reliability at scale.

Skills

Required

  • Python
  • C++
  • Rust
  • Golang
  • vLLM
  • TensorRT
  • Llama.cpp
  • Ollama
  • Docker
  • Kubernetes
  • AWS
  • GCP
  • Azure
  • NVIDIA GPUs
  • TPUs

Nice to have

  • Speculative Decoding
  • PagedAttention
  • open-source inference libraries
  • hardware-level kernel development
  • CUDA
  • Triton kernels
  • MLOps
  • SRE
  • high-performance AI endpoints
  • reliability during demand surges
  • high-throughput inference environments
  • traffic bursts

What the JD emphasized

  • low-latency AI serving solutions
  • high-performance AI workflows
  • high-performance AI endpoints
  • high-throughput inference environments
  • low-latency, scalable, and high-performing AI prediction systems

Other signals

  • Optimizing LLMs for inference
  • Maximizing throughput and minimizing latency
  • Deploying to diverse environments (data center to edge)