AI Software Engineer Intern

Intel Intel · Semiconductors · Shanghai, China +2

Internship role focused on applied research and productization of Vision-Language Models (VLM) and Vision-Language-Action (VLA) models, including pre-training, fine-tuning, alignment, data pipelines, fusion strategies, action components, and model optimization for efficient deployment on Intel hardware. The role involves evaluating models and potentially publishing results.

What you'd actually do

  1. Conduct applied research on VLM / VLA architectures, pre-training, fine-tuning, and alignment techniques (SFT, RLHF, DPO, GRPO); reproduce and extend recent work such as OpenVLA, RT-2, π05, PaLI-Gemma, Qwen-VL, and InternVL.
  2. Design and implement multi-modal data pipelines for cleaning, synthesis, and augmentation of image-text-action datasets.
  3. Investigate efficient fusion strategies between vision encoders (ViT, SigLIP, DINOv2) and language backbones (LLaMA, Qwen, Mistral), including connector design and visual token compression.
  4. Explore VLA-specific components such as action heads (discrete tokenization, diffusion policy, flow matching), long-horizon planning, and closed-loop control.
  5. Apply model optimization techniques — quantization (INT8 / FP8 / INT4, AWQ, GPTQ, SmoothQuant), pruning, distillation, KV-cache optimization, and speculative decoding — to enable efficient deployment on Intel platforms.

Skills

Required

  • MS or PhD in Computer Science, Electrical Engineering, Artificial Intelligence, Mathematics, or a related technical field
  • Python
  • PyTorch
  • Deep learning fundamentals
  • Transformers
  • Diffusion models
  • Reinforcement learning
  • Distributed training (DeepSpeed, FSDP, Megatron-LM)
  • Multi-modal large models
  • Embodied AI/Lerobot learning
  • Model compression and inference acceleration
  • Vision-language pre-training

Nice to have

  • Model quantization
  • Pruning
  • Distillation
  • Inference frameworks (Pytorch, vLLM, SGLang, TensorRT-LLM, llama.cpp)
  • Robotics simulation environments (Isaac Sim, MuJoCo, ManiSkill, RoboCasa)
  • Real-robot systems
  • Open-source contributions

What the JD emphasized

  • multi-modal intelligence
  • Vision-Language Models (VLM)
  • Vision-Language-Action (VLA) models
  • embodied AI
  • efficient deployment on Intel hardware
  • Python and PyTorch
  • Transformers, diffusion models, and reinforcement learning

Other signals

  • multi-modal intelligence
  • Vision-Language Models (VLM)
  • Vision-Language-Action (VLA) models
  • embodied AI
  • efficient deployment on Intel hardware