Principal Software Engineer

Microsoft Microsoft · Big Tech · Bengaluru, KA, IN +1 · Software Engineering

Principal Software Engineer role focused on building and supporting large-scale GPU management infrastructure and inference/training platforms for AI workloads at Microsoft. The role involves architecting, designing, and developing core AI infrastructure services and compute, storage, and networking subsystems for LLM training, customization, and inference.

What you'd actually do

  1. Architect, design, and develop core AI Infrastructure services developed in Go, Rust, Python, C++, and C# deployed on large-scale Kubernetes clusters to support pre-training and post-training of state-of-the-art LLMs, SLMs, multimodal, and code-specific models.
  2. Design, build, and manage compute, storage and networking sub-system on large-scale GPU clusters to support LLM training, customization, and inference workloads.
  3. Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds.
  4. Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices.
  5. Support development and troubleshooting from the frontline, resolving complex issues impacting large-scale services.

Skills

Required

  • 10+ years designing, developing, and shipping high quality software
  • 4+ years of experience with distributed systems and cloud based infrastructure
  • 2+ year of experience with DevOps practices (CI/CD, automated testing, deployment, etc.)
  • production ML systems
  • large-scale training infrastructure
  • NCCL
  • CUDA libraries and tools

Nice to have

  • 10+ years of software development experience in C#, C++, Python, or similar languages
  • 6+ years of experience with containerization tools (e.g., Docker, Kubernetes)

What the JD emphasized

  • large-scale GPU management infrastructure
  • inference and training platforms
  • large-scale training and inference platform
  • large-scale AI Supercomputers
  • large-scale Kubernetes clusters
  • LLM training, customization, and inference workloads
  • large-scale complex training environments

Other signals

  • large-scale GPU management infrastructure
  • inference and training platforms
  • large-scale training and inference platform
  • large-scale AI Supercomputers