Inference Optimization Engineer (local / Edge Runtime)

Intel Intel · Semiconductors · California, Santa Clara, United States +3

This role focuses on optimizing AI inference engines (like llama.cpp, vLLM) for constrained local and edge hardware, including GPUs/iGPUs and Vulkan backends. The goal is to improve latency, throughput, and memory usage for interactive agent workloads, driving quantization strategies and reducing CPU overhead. This is crucial for making hybrid, low-cost agent products viable.

What you'd actually do

  1. Profile and optimize local inference (llama.cpp-vulkan and vLLM) for latency, throughput, and memory on edge hardware
  2. Tune KV cache, continuous batching, and scheduling for interactive agent workloads
  3. Drive quantization strategy (GGUF / AWQ / GPTQ) and validate quality impact with the Post-Training team
  4. Cut CPU overhead and improve engine startup, model load, and lifecycle (start / stop / health)
  5. Benchmark across hardware tiers and publish honest performance comparisons

Skills

Required

  • BS/MS in CS, EE, Math or related STEM field
  • 5+ years software development background
  • Strong in C++ and/or Python
  • Understands how LLM inference works (attention, KV cache, decoding)
  • Has profiled and optimized real performance problems (CPU or GPU) and can prove the speedup
  • Linux, build systems, and low-level debugging expertise

Nice to have

  • Hands-on with llama.cpp, vLLM, ggml, or similar engines
  • Experience with GPU / accelerator programming (Vulkan, CUDA, SYCL, Metal) or SIMD / CPU kernels
  • Familiarity with quantization formats and their quality trade-offs
  • Open-source contributions to inference engines

What the JD emphasized

  • local inference
  • edge hardware
  • interactive agent workloads
  • quantization strategy
  • CPU overhead

Other signals

  • optimize inference engines
  • local and edge environments
  • low-cost agent product viable