AI Software Engineer Intern

Intel Intel · Semiconductors · Shanghai, China +2

Intern role focused on optimizing CPU kernels for AI workloads, including LLMs and multimodal models, using Intel architecture features and performance profiling tools. Integrates custom operators into production frameworks.

What you'd actually do

  1. Design and hand-tune CPU kernels for Transformer operators (Attention, GEMM, LayerNorm, RMSNorm, RoPE, MoE, Softmax) and classical operators (Conv2D / Conv3D, Depthwise Conv, Winograd, im2col, Pooling, BatchNorm, RNN / LSTM / GRU).
  2. Develop SIMD-optimized implementations using Intel® AVX2 / AVX-512 / AMX / VNNI intrinsics, with ARM Neon / SVE as a secondary target where applicable.
  3. Apply parallelization strategies (OpenMP, TBB, thread-pool design) and exploit CPU micro-architectural features: cache blocking and tiling, NUMA affinity, prefetching, memory alignment, and false-sharing mitigation.
  4. Implement and optimize low-bit quantized kernels (INT8 / INT4 / W4A16 / W8A8) for LLM / VLM inference, leveraging Intel® AMX and VNNI for maximum throughput per watt.
  5. Integrate custom operators into production frameworks and runtimes, including Intel® oneDNN, PyTorch CPU backend, ONNX Runtime, llama.cpp, MLC-LLM, and XNNPACK.

Skills

Required

  • C / C++
  • computer architecture
  • CPU pipelines
  • cache hierarchies
  • memory models
  • SIMD execution
  • x86 SIMD intrinsics (AVX2 / AVX-512 / AMX)
  • ARM Neon / SVE intrinsics
  • OpenMP / TBB-based multi-threaded optimization
  • High-performance CPU GEMM or convolution implementation
  • performance profiling tools (Intel® VTune™ Profiler, perf)

Nice to have

  • Open-source contributions to oneDNN, OpenVINO™ toolkit, llama.cpp, ggml, XNNPACK, OpenBLAS, PyTorch, or ONNX Runtime
  • CNN inference optimizations
  • LLM inference optimization techniques
  • compiler infrastructure (LLVM, MLIR, TVM)
  • auto-tuning frameworks (AutoTVM, Ansor)
  • Edge or on-device deployment experience

Other signals

  • CPU kernel optimization
  • LLM/VLM inference
  • performance engineering