Member of Technical Staff - Inference

xAI xAI · AI Frontier · Palo Alto, CA · Infrastructure

The role focuses on designing and optimizing large-scale model serving systems for high-performance inference, ensuring speed and reliability for millions of users. Responsibilities include architecting distributed infrastructure, optimizing latency and throughput, building high-concurrency systems, and accelerating inference engines.

What you'd actually do

  1. Architect and implement scalable distributed infrastructure for model serving (load balancing, auto-scaling, batch scheduling, global KV cache).
  2. Optimize latency and throughput of model inference under real production workloads.
  3. Build reliable, high-concurrency serving systems that serve billions of users with 100% uptime, 0% error rate, and excellent tail latency.
  4. Benchmark, fine-tune, and accelerate inference engines (including low-level GPU kernel work and code generation).
  5. Develop custom tools to trace, replay, and fix issues across the full stack — from orchestration down to GPU kernels.

Skills

Required

  • Deep low-level systems programming (C/C++ or Rust)
  • Experience with large-scale, high-concurrent production serving.
  • Experience with GPU inference engines (vLLM, SGLang, Triton, TensorRT-LLM, etc.).
  • Strong background in system optimizations: batching, caching, load balancing, parallelism.
  • Low-level inference optimizations: GPU kernels, code generation.
  • Algorithmic inference optimizations: quantization, speculative decoding, distillation, low-precision numerics.
  • Experience with testing, benchmarking, and reliability of inference services.
  • Experience designing and implementing CI/CD infrastructure for inference.

What the JD emphasized

  • lightning speed
  • perfect reliability
  • 100% uptime
  • 0% error rate
  • excellent tail latency

Other signals

  • high-performance inference platform
  • large-scale model serving systems
  • lightning speed and perfect reliability
  • massive scale