AI Software Engineer Intern

at Intel · Industrial · Shanghai, China +2

This role focuses on building and optimizing a next-generation LLM inference system, including model optimization, inference runtime, and system-level design. It involves research and engineering to implement and optimize core techniques across the stack from model to kernels to runtime to distributed systems, with a key focus on GPU kernel and runtime optimization for an end-to-end AI rack software system for LLM inference.

What you'd actually do

  1. Study cutting-edge work (LLM inference, MoE, system optimization)
  2. Implement and optimize core techniques
  3. Work across the stack: model → kernels → runtime → distributed system
  4. Develop and optimize GPU kernels using modern approaches
  5. Build and optimize a full inference stack

Skills

Required

  • Master’s or PhD student
  • Python
  • PyTorch
  • transformer models
  • algorithms
  • systems

Nice to have

  • GPU programming (CUDA, Triton, or similar)
  • LLM inference frameworks (vLLM, TensorRT-LLM, FasterTransformer)
  • Distributed systems or parallel computing
  • GPU architecture and performance profiling
  • Quantization or model optimization
  • MoE or large-scale model systems

What the JD emphasized

  • next-generation LLM inference system
  • model optimization
  • inference runtime
  • system-level design
  • GPU kernel and runtime optimization
  • end-to-end AI rack software system
  • working, optimized implementations
  • latency, throughput, and GPU utilization
  • efficient inference for sparse models
  • low-level frameworks
  • tensor workloads
  • full inference stack
  • Multi-GPU / multi-node scaling
  • paper → implementation → optimization
  • performance and system-level problems
  • deep technical challenges
  • how LLM systems actually run at scale

Other signals

  • LLM inference system
  • model optimization
  • inference runtime
  • GPU kernel and runtime optimization
  • end-to-end AI rack software system
Read full job description

Job Details:

Job Description:

We are building a next-generation LLM inference system, spanning model optimization, inference runtime, and system-level design.

This is a research + engineering role where you will:

  • Study cutting-edge work (LLM inference, MoE, system optimization)
  • Implement and optimize core techniques
  • Work across the stack: model → kernels → runtime → distributed system

A key focus is GPU kernel and runtime optimization, including exploring Triton-like programming models and compiler approaches, as part of building an end-to-end AI rack software system for LLM inference.

Key Responsibilities

1. Research & Prototyping

  • Read and reproduce state-of-the-art work (LLM inference, MoE, systems)
  • Translate ideas into working, optimized implementations
  • Identify bottlenecks and iterate beyond baseline performance

2. LLM Inference Optimization

  • Implement and evaluate techniques such as:

    • Continuous / dynamic batching
    • KV cache optimization and memory management
    • Speculative decoding
    • Flash / paged attention
    • Quantization (INT8 / FP8 / low-bit)
  • Optimize for latency, throughput, and GPU utilization

3. MoE (Mixture-of-Experts) Systems

  • Explore efficient inference for sparse models:

    • Routing strategies and load balancing
    • Expert parallelism and sharding
    • Communication vs computation trade-offs
  • Improve scalability and efficiency of MoE inference

4. Kernel & Runtime Optimization

  • Develop and optimize GPU kernels using modern approaches:

    • Triton-like programming models
    • CUDA or equivalent low-level frameworks
  • Investigate:

    • Memory access patterns and layout optimization
    • Operator fusion and kernel efficiency
    • Compiler-style optimization for tensor workloads
  • Compare different kernel/runtime strategies and integrate into the system

5. End-to-End Inference System Development

  • Build and optimize a full inference stack:

    • Model execution layer (vLLM, TensorRT-LLM, or similar)
    • Runtime scheduling and batching
    • Distributed inference across GPUs/nodes
  • Work on:

    • Multi-GPU / multi-node scaling
    • NCCL / communication optimization
    • System-level performance tuning

Qualifications:

Basic Requirements

  • Master’s or PhD student required (CS, EE, or related field)
  • Strong programming skills (Python required)
  • Familiar with PyTorch and transformer models
  • Solid fundamentals in algorithms and systems
  • Available for at least 6 months (shorter durations are not considered)

Preferred Experience

  • Experience with one or more:

    • GPU programming (CUDA, Triton, or similar)
    • LLM inference frameworks (vLLM, TensorRT-LLM, FasterTransformer)
    • Distributed systems or parallel computing
  • Knowledge of:

    • GPU architecture and performance profiling
    • Quantization or model optimization
    • MoE or large-scale model systems

What We Look For

  • Ability to go from paper → implementation → optimization
  • Strong interest in performance and system-level problems
  • Fast execution and willingness to work on deep technical challenges
  • Curiosity about how LLM systems actually run at scale

Job Type:

Student / Intern

Shift:

Shift 1 (China)

Primary Location:

PRC, Shanghai

Additional Locations:

PRC, Beijing, PRC, Shenzhen

Business group:

The Sales and Marketing Group (SMG) leverages the product portfolio to drive Intel's revenue growth and market expansion, blending strategic initiatives with dynamic sales efforts to capture and retain customers. SMG is responsible for empowering the sales force with tools and insights needed to close deals and build lasting customer relationships. Sales analytics and market research ensure strategies are both targeted and impactful. In SMG, disciplined execution, creativity, and ambition are celebrated, providing ample opportunities for career advancement and skill development.

Posting Statement:

All qualified applicants will receive consideration for employment without regard to race, color, religion, religious creed, sex, national origin, ancestry, age, physical or mental disability, medical condition, genetic information, military and veteran status, marital status, pregnancy, gender, gender expression, gender identity, sexual orientation, or any other characteristic protected by local law, regulation, or ordinance.

Position of Trust

N/A

Work Model for this Role

This role will require an on-site presence. * Job posting details (such as work model, location or time type) are subject to change.

  • ADDITIONAL INFORMATION: Intel is committed to Responsible Business Alliance (RBA) compliance and ethical hiring practices. We do not charge any fees during our hiring process. Candidates should never be required to pay recruitment fees, medical examination fees, or any other charges as a condition of employment. If you are asked to pay any fees during our hiring process, please report this immediately to your recruiter.