Job Details:

Job Description:

Intel's Data Center Network Edge AI team is responsible for delivering best-in-class AI performance on Intel® architecture. From hyperscale data centers powered by Intel® Xeon® processors to network edge nodes, our performance engineers shape the inner loops of frameworks and operator libraries that millions of developers and customers rely on every day.

We are seeking an intern to join our CPU performance engineering team and drive operator-level optimizations for modern AI workloads, including Transformer-based LLMs, VLM / VLA multi-modal models, classical CNNs, and MLP models, etc. You will design, implement, and tune high-performance CPU kernels that translate Intel architectural advantages — AVX-512, Intel® AMX, and VNNI — into measurable end-user value.

Responsibilities

Design and hand-tune CPU kernels for Transformer operators (Attention, GEMM, LayerNorm, RMSNorm, RoPE, MoE, Softmax) and classical operators (Conv2D / Conv3D, Depthwise Conv, Winograd, im2col, Pooling, BatchNorm, RNN / LSTM / GRU).
Develop SIMD-optimized implementations using Intel® AVX2 / AVX-512 / AMX / VNNI intrinsics, with ARM Neon / SVE as a secondary target where applicable.
Apply parallelization strategies (OpenMP, TBB, thread-pool design) and exploit CPU micro-architectural features: cache blocking and tiling, NUMA affinity, prefetching, memory alignment, and false-sharing mitigation.
Implement and optimize low-bit quantized kernels (INT8 / INT4 / W4A16 / W8A8) for LLM / VLM inference, leveraging Intel® AMX and VNNI for maximum throughput per watt.
Integrate custom operators into production frameworks and runtimes, including Intel® oneDNN, PyTorch CPU backend, ONNX Runtime, llama.cpp, MLC-LLM, and XNNPACK.
Conduct systematic performance analysis using Intel® VTune™ Profiler, Linux perf, and roofline modeling; identify bottlenecks and quantify optimization gains.
Contribute reusable kernels, optimization templates, and best-practice documentation to Intel's internal performance libraries.

Qualifications:

Minimum Qualifications

The candidate must have the right to work in the country of employment without restriction.

Currently pursuing a BS (senior year), MS, or PhD in Computer Science, Electrical Engineering, Computer Engineering, Parallel Computing, or a related technical field.
Available for a minimum of 3 months of full-time or near full-time engagement.
Strong proficiency in C / C++ and solid understanding of computer architecture, including CPU pipelines, cache hierarchies, memory models, and SIMD execution.
Hands-on experience with at least one of:
- x86 SIMD intrinsics (AVX2 / AVX-512 / AMX)
- ARM Neon / SVE intrinsics
- OpenMP / TBB-based multi-threaded optimization
- High-performance CPU GEMM or convolution implementation (e.g., referencing oneDNN, OpenBLAS, XNNPACK, ggml)
Experience with performance profiling tools (Intel® VTune™ Profiler, perf) and the ability to translate profile data into concrete optimizations.

Preferred Qualifications

Open-source contributions to projects such as oneDNN, OpenVINO™ toolkit, llama.cpp, ggml, XNNPACK, OpenBLAS, PyTorch, or ONNX Runtime.
Familiarity with CNN inference optimizations: Winograd, im2col + GEMM, Direct Conv, NCHW / NHWC layout transforms.
Familiarity with LLM inference optimization techniques: KV-cache management, continuous batching, speculative decoding, and low-bit quantization.
Experience with compiler infrastructure (LLVM, MLIR, TVM) or auto-tuning frameworks (AutoTVM, Ansor).
Edge or on-device deployment experience (ARM servers, AI PCs, embedded SoCs).

Job Type:

Student / Intern

Shift:

Shift 1 (China)

Primary Location:

PRC, Shanghai

Additional Locations:

PRC, Beijing, PRC, Shenzhen

Business group:

The Sales and Marketing Group (SMG) leverages the product portfolio to drive Intel's revenue growth and market expansion, blending strategic initiatives with dynamic sales efforts to capture and retain customers. SMG is responsible for empowering the sales force with tools and insights needed to close deals and build lasting customer relationships. Sales analytics and market research ensure strategies are both targeted and impactful. In SMG, disciplined execution, creativity, and ambition are celebrated, providing ample opportunities for career advancement and skill development.

Posting Statement:

All qualified applicants will receive consideration for employment without regard to race, color, religion, religious creed, sex, national origin, ancestry, age, physical or mental disability, medical condition, genetic information, military and veteran status, marital status, pregnancy, gender, gender expression, gender identity, sexual orientation, or any other characteristic protected by local law, regulation, or ordinance.

Position of Trust

N/A

Work Model for this Role

This role will require an on-site presence. * Job posting details (such as work model, location or time type) are subject to change.

ADDITIONAL INFORMATION: Intel is committed to Responsible Business Alliance (RBA) compliance and ethical hiring practices. We do not charge any fees during our hiring process. Candidates should never be required to pay recruitment fees, medical examination fees, or any other charges as a condition of employment. If you are asked to pay any fees during our hiring process, please report this immediately to your recruiter.

Job Details:

Job Description:

Responsibilities

Design and hand-tune CPU kernels for Transformer operators (Attention, GEMM, LayerNorm, RMSNorm, RoPE, MoE, Softmax) and classical operators (Conv2D / Conv3D, Depthwise Conv, Winograd, im2col, Pooling, BatchNorm, RNN / LSTM / GRU).
Develop SIMD-optimized implementations using Intel® AVX2 / AVX-512 / AMX / VNNI intrinsics, with ARM Neon / SVE as a secondary target where applicable.
Apply parallelization strategies (OpenMP, TBB, thread-pool design) and exploit CPU micro-architectural features: cache blocking and tiling, NUMA affinity, prefetching, memory alignment, and false-sharing mitigation.
Implement and optimize low-bit quantized kernels (INT8 / INT4 / W4A16 / W8A8) for LLM / VLM inference, leveraging Intel® AMX and VNNI for maximum throughput per watt.
Integrate custom operators into production frameworks and runtimes, including Intel® oneDNN, PyTorch CPU backend, ONNX Runtime, llama.cpp, MLC-LLM, and XNNPACK.
Conduct systematic performance analysis using Intel® VTune™ Profiler, Linux perf, and roofline modeling; identify bottlenecks and quantify optimization gains.
Contribute reusable kernels, optimization templates, and best-practice documentation to Intel's internal performance libraries.

Qualifications:

Minimum Qualifications

The candidate must have the right to work in the country of employment without restriction.

Currently pursuing a BS (senior year), MS, or PhD in Computer Science, Electrical Engineering, Computer Engineering, Parallel Computing, or a related technical field.
Available for a minimum of 3 months of full-time or near full-time engagement.
Strong proficiency in C / C++ and solid understanding of computer architecture, including CPU pipelines, cache hierarchies, memory models, and SIMD execution.
Hands-on experience with at least one of:
- x86 SIMD intrinsics (AVX2 / AVX-512 / AMX)
- ARM Neon / SVE intrinsics
- OpenMP / TBB-based multi-threaded optimization
- High-performance CPU GEMM or convolution implementation (e.g., referencing oneDNN, OpenBLAS, XNNPACK, ggml)
Experience with performance profiling tools (Intel® VTune™ Profiler, perf) and the ability to translate profile data into concrete optimizations.

Preferred Qualifications

Open-source contributions to projects such as oneDNN, OpenVINO™ toolkit, llama.cpp, ggml, XNNPACK, OpenBLAS, PyTorch, or ONNX Runtime.
Familiarity with CNN inference optimizations: Winograd, im2col + GEMM, Direct Conv, NCHW / NHWC layout transforms.
Familiarity with LLM inference optimization techniques: KV-cache management, continuous batching, speculative decoding, and low-bit quantization.
Experience with compiler infrastructure (LLVM, MLIR, TVM) or auto-tuning frameworks (AutoTVM, Ansor).
Edge or on-device deployment experience (ARM servers, AI PCs, embedded SoCs).

Job Type:

Student / Intern

Shift:

Shift 1 (China)

Primary Location:

PRC, Shanghai

Additional Locations:

PRC, Beijing, PRC, Shenzhen

Business group:

Posting Statement:

Position of Trust

N/A

Work Model for this Role

This role will require an on-site presence. * Job posting details (such as work model, location or time type) are subject to change.

ADDITIONAL INFORMATION: Intel is committed to Responsible Business Alliance (RBA) compliance and ethical hiring practices. We do not charge any fees during our hiring process. Candidates should never be required to pay recruitment fees, medical examination fees, or any other charges as a condition of employment. If you are asked to pay any fees during our hiring process, please report this immediately to your recruiter.

AI Software Engineer Intern

What you'd actually do

Skills

Required

Nice to have

Other signals

Job Details:

Job Description:

Qualifications:

Job Type:

Shift:

Primary Location:

Additional Locations:

Business group:

Posting Statement:

Position of Trust

Job Details:

Job Description:

Qualifications:

Job Type:

Shift:

Primary Location:

Additional Locations:

Business group:

Posting Statement:

Position of Trust