Cloud and AI System Intern

Intel Intel · Semiconductors · Shanghai, China

Research intern focusing on system reliability (RAS) and silent data error characterization and mitigation for AI and general-purpose compute platforms, including heterogeneous systems and large-scale server clusters. Responsibilities include designing and running experiments, analyzing logs, and prototyping detection/diagnosis methods to improve data integrity and platform robustness across the HW/FW/OS/runtime stack.

What you'd actually do

  1. Collect, clean, and analyze platform telemetry / error logs from CPU servers and accelerator-enabled nodes (e.g., memory/DDR/HBM, storage, interconnect, PCIe/CXL, fabrics) to identify error signatures and failure patterns.
  2. Design and execute fault injection, stress tests, or workload-driven experiments to reproduce silent data corruption scenarios for AI training/inference and general compute workloads, and validate hypotheses.
  3. Research and analyze in-field scan and lockstep mode features (coverage, limitations, trigger conditions, and impact on AI/CPU workloads), and help evaluate how they can be leveraged to improve silent error detection and data integrity in production.
  4. Research and analyze Silicon Lifecycle Management (SLM) solutions, and integrate them with platform telemetry to enable in-field health monitoring, degradation/trend analysis, and proactive reliability improvements for AI/CPU platforms.
  5. Develop scripts/tools (Python preferred) to automate data processing, experiment orchestration, and report generation; build dashboards or repeatable pipelines when needed.

Skills

Required

  • Python
  • data analysis
  • research mindset
  • form hypotheses
  • design experiments
  • write clear technical reports
  • Mandarin and English communication skills

Nice to have

  • Linux
  • basic scripting
  • Github copilot
  • pandas
  • numpy
  • matplotlib
  • SQL
  • log analytics
  • computer architecture
  • systems (memory hierarchy, storage, networking)
  • RAS concepts (ECC, CRC, parity, scrubbing, checkpoints)
  • AI system stack (GPU/accelerators, driver/runtime, distributed training/inference, communication collectives, data pipelines, and performance/reliability trade-offs)

What the JD emphasized

  • AI training/inference
  • silent data error
  • data integrity
  • platform robustness

Other signals

  • AI training/inference workloads
  • platform robustness
  • data integrity
  • heterogeneous systems (CPU + GPU/accelerators)