What you'd actually do

Architect, design, and develop core AI Infrastructure services developed in Go, Rust, Python, C++, and C# deployed on large-scale Kubernetes clusters to support pre-training and post-training of state-of-the-art LLMs, SLMs, multimodal, and code-specific models.

Collaborate closely with engineers, researchers and external partners to debug, diagnose, and improve stability of large-scale training runs.

Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds.

Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices.

Skills

Required

Bachelor's Degree in Computer Science or related technical field
2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

2+ years designing, developing, and shipping high quality software.
2+ years of experience with distributed systems and cloud-based infrastructure.
1+ year of experience with DevOps practices (CI/CD, automated testing, deployment, etc.).
2+ years of software development experience in C#, C++, Python, or similar languages.
2+ years of experience with containerization tools (e.g., Docker, Kubernetes).
Knowledge and hands on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools.

Overview

CoreAI is at the forefront of Microsoft’s mission to redefine how software is built and experienced. We are responsible for building the foundational platforms, services, programming models, and developer experiences that power the next generation of applications using Generative AI. Our work enables developers and enterprises to harness the full potential of AI to create intelligent, adaptive, and transformative software.

The AI Core Infrastructure team, part of AI Platform team in CoreAI Organization is responsible for large-scale, highly reliable and efficient GPU management infrastructure and the inference and training platforms that powers all of Microsoft’s AI workloads, such as M365 CoPilot, Github CoPilot, Microsoft CoPilot, AI Foundry’s Inference and Fine-Tuning offering of OAI and OSS models, and many more.

As a Software Engineer on the training infrastructure team, you will work on cutting edge infrastructure and tools to support large scale model pre-training, post-training and fine-tuning on latest generation of NVIDIA and AMD GPUs in Azure and Microsoft partner clouds on some of the world’s largest AI Supercomputers.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.

Responsibilities

As an engineer on the training infrastructure team, your responsibilities include:

Architect, design, and develop core AI Infrastructure services developed in Go, Rust, Python, C++, and C# deployed on large-scale Kubernetes clusters to support pre-training and post-training of state-of-the-art LLMs, SLMs, multimodal, and code-specific models.
Collaborate closely with engineers, researchers and external partners to debug, diagnose, and improve stability of large-scale training runs.
Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds.
Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices.

Qualifications

**Required Qualifications: **

Bachelor's Degree in Computer Science or related technical field and 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience.

**Other Requirements: **

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years

Preferred Qualifications:

2+ years designing, developing, and shipping high quality software.
2+ years of experience with distributed systems and cloud-based infrastructure.
1+ year of experience with DevOps practices (CI/CD, automated testing, deployment, etc.).
2+ years of software development experience in C#, C++, Python, or similar languages.
2+ years of experience with containerization tools (e.g., Docker, Kubernetes).
Knowledge and hands on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools.

#CoreAI

Software Engineering IC3 - The typical base pay range for this role across the U.S. is USD $100,600 - $199,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $131,400 - $215,400 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Software Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**

Overview

In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.

Responsibilities

As an engineer on the training infrastructure team, your responsibilities include:

Architect, design, and develop core AI Infrastructure services developed in Go, Rust, Python, C++, and C# deployed on large-scale Kubernetes clusters to support pre-training and post-training of state-of-the-art LLMs, SLMs, multimodal, and code-specific models.
Collaborate closely with engineers, researchers and external partners to debug, diagnose, and improve stability of large-scale training runs.
Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds.
Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices.

Qualifications

**Required Qualifications: **

Bachelor's Degree in Computer Science or related technical field and 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience.

**Other Requirements: **

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years

Preferred Qualifications:

2+ years designing, developing, and shipping high quality software.
2+ years of experience with distributed systems and cloud-based infrastructure.
1+ year of experience with DevOps practices (CI/CD, automated testing, deployment, etc.).
2+ years of software development experience in C#, C++, Python, or similar languages.
2+ years of experience with containerization tools (e.g., Docker, Kubernetes).
Knowledge and hands on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools.

#CoreAI

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Software Engineering, Coreai

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals