Senior Software Engineer, Coreai

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Software Engineering

The AI Core Infrastructure team is responsible for building and managing the large-scale GPU management infrastructure and inference/training platforms that power Microsoft's AI workloads. This role focuses on the training infrastructure for large-scale model pre-training, post-training, and fine-tuning on advanced GPUs in Azure and partner clouds.

What you'd actually do

  1. Architect, design, and develop core AI Infrastructure services developed in Go, Rust, Python, C++, and C# deployed on large-scale Kubernetes clusters to support pre-training and post-training of state-of-the-art LLMs, SLMs, multimodal, and code-specific models.
  2. Collaborate closely with engineers, researchers and external partners to debug, diagnose, and improve stability of large-scale training runs.
  3. Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds.
  4. Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field
  • 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check

Nice to have

  • 3+ years designing, developing, and shipping high quality software
  • 2+ years of experience with distributed systems and cloud-based infrastructure
  • 1+ year of experience with DevOps practices (CI/CD, automated testing, deployment, etc.)
  • 4+ years of software development experience in C#, C++, Python, or similar languages
  • 2+ years of experience with containerization tools (e.g., Docker, Kubernetes)
  • Knowledge and hands on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools

What the JD emphasized

  • large-scale
  • training infrastructure
  • pre-training
  • post-training
  • fine-tuning
  • large-scale Kubernetes clusters
  • large-scale training runs
  • large-scale complex training environments
  • production ML systems
  • large-scale training infrastructure

Other signals

  • large-scale model pre-training
  • large-scale training runs
  • large-scale training environments
  • large-scale AI Supercomputers
  • production ML systems
  • large-scale training infrastructure