AI Software Engineer Intern

Intel Intel · Semiconductors · Shanghai, China +2

This role focuses on building and optimizing a next-generation LLM inference system, including model optimization, inference runtime, and system-level design. It involves research and engineering to implement and optimize core techniques across the stack from model to kernels to runtime to distributed systems, with a key focus on GPU kernel and runtime optimization for an end-to-end AI rack software system for LLM inference.

What you'd actually do

  1. Study cutting-edge work (LLM inference, MoE, system optimization)
  2. Implement and optimize core techniques
  3. Work across the stack: model → kernels → runtime → distributed system
  4. Develop and optimize GPU kernels using modern approaches
  5. Build and optimize a full inference stack

Skills

Required

  • Master’s or PhD student
  • Python
  • PyTorch
  • transformer models
  • algorithms
  • systems

Nice to have

  • GPU programming (CUDA, Triton, or similar)
  • LLM inference frameworks (vLLM, TensorRT-LLM, FasterTransformer)
  • Distributed systems or parallel computing
  • GPU architecture and performance profiling
  • Quantization or model optimization
  • MoE or large-scale model systems

What the JD emphasized

  • next-generation LLM inference system
  • model optimization
  • inference runtime
  • system-level design
  • GPU kernel and runtime optimization
  • end-to-end AI rack software system
  • working, optimized implementations
  • latency, throughput, and GPU utilization
  • efficient inference for sparse models
  • low-level frameworks
  • tensor workloads
  • full inference stack
  • Multi-GPU / multi-node scaling
  • paper → implementation → optimization
  • performance and system-level problems
  • deep technical challenges
  • how LLM systems actually run at scale

Other signals

  • LLM inference system
  • model optimization
  • inference runtime
  • GPU kernel and runtime optimization
  • end-to-end AI rack software system