What you'd actually do

Collaboration with engineers and researchers to build and optimize training infrastructure and tools for LLMs, SLMs, multimodal, and code-specific models.

Design, build and improve services with high scalability and reliability.

Design and implement the services to serve the prod traffic and fulfill the security and privacy requirements.

Participate in efforts to deliver and improve engineering systems and practices to ensure service quality in complex cloud environments.

Contribute to the deployment and monitoring of services in production environments.

Skills

Required

Bachelor's Degree in Computer Science or related technical field and 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience.
5+ years of software engineering experience, with significant ownership of production services, cloud platforms, distributed systems, or developer infrastructure.
Strong experience building and operating containerized platforms using Kubernetes or similar orchestration systems.
Strong coding skills in one or more systems or backend languages such as Python, Go, Rust, C++, C#, or Java.
Experience designing reliable production APIs, backend services, or control-plane systems that manage compute, storage, networking, or runtime environments.
Solid understanding of cloud infrastructure fundamentals, including identity, networking, storage, observability, capacity planning, security, and safe deployment practices.
Experience diagnosing production issues using logs, metrics, traces, dashboards, and incident response processes.
Demonstrated ability to lead technical design, drive ambiguous projects to completion, mentor other engineers, and collaborate across teams.

Nice to have

Experience with Microsoft Azure, AWS, or Google Cloud, especially managed Kubernetes, container registries, object storage, private networking, identity, secrets, and monitoring services.
Experience building multi-tenant platforms where reliability, fairness, quota management, isolation, and security are important.
Experience with sandboxed execution environments, remote development environments, hosted notebook/tool environments, evaluation infrastructure, or ephemeral compute platforms.
Experience with container image build systems, registry authentication, image caching, package caching, artifact distribution, or startup-latency optimization.
Experience with cloud networking concepts such as ingress, DNS, proxies, egress control, private endpoints, service routing, and traffic management.
Experience with secure runtime design, including authentication, authorization, workload identity, secret handling, network isolation, and protecting shared infrastructure from untrusted workloads.
Experience with AI infrastructure, agent execution, evaluation platforms, GPU workloads, Windows/Linux runtime environments, or VM/container hybrid systems.
Experience improving service operability through structured logging, distributed tracing, dashboards, alerting, automated validation, and incident playbooks.

Overview

Joining the CoreAI organization at Microsoft means becoming part of the team that builds the end-to-end AI stack powering Azure’s innovation. As a member of the FIT training team within CoreAI, you will help develop the AI infrastructure that accelerates the creation of agentic AI systems across Microsoft. This role is dedicated to advancing scientific methods and scalable infrastructure for training agentic models to achieve frontier-level performance. You will contribute to LLMs, SLMs, and agentic models using both proprietary and open-source frameworks, all aimed at delivering reliable, enterprise-grade agentic workflows.

We are seeking a curious, independent, adaptable problem-solver who thrives on continuous learning, embraces changing priorities, and is motivated by creating meaningful impact. Candidates must be able to lead and role models for team that is driven, able to write efficient code, debug complex training jobs, document findings, and demonstrate a track record of continuous improvement. In addition, we value an agile, startup-style mindset - someone who can iterate quickly, pivot when needed, and collaborate effectively in fast-paced, dynamic environments.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.

Responsibilities

As a member of our team, you will participate in developing innovative solutions across AI Platform. Responsibilities include the following.

Collaboration with engineers and researchers to build and optimize training infrastructure and tools for LLMs, SLMs, multimodal, and code-specific models.
Design, build and improve services with high scalability and reliability.
Design and implement the services to serve the prod traffic and fulfill the security and privacy requirements.
Participate in efforts to deliver and improve engineering systems and practices to ensure service quality in complex cloud environments.
Contribute to the deployment and monitoring of services in production environments.

Qualifications

Required Qualifications:

Bachelor's Degree in Computer Science or related technical field and 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience.

Other Requirements:

5+ years of software engineering experience, with significant ownership of production services, cloud platforms, distributed systems, or developer infrastructure.
Strong experience building and operating containerized platforms using Kubernetes or similar orchestration systems.
Strong coding skills in one or more systems or backend languages such as Python, Go, Rust, C++, C#, or Java.
Experience designing reliable production APIs, backend services, or control-plane systems that manage compute, storage, networking, or runtime environments.
Solid understanding of cloud infrastructure fundamentals, including identity, networking, storage, observability, capacity planning, security, and safe deployment practices.
Experience diagnosing production issues using logs, metrics, traces, dashboards, and incident response processes.
Demonstrated ability to lead technical design, drive ambiguous projects to completion, mentor other engineers, and collaborate across teams.

Preferred Qualifications:

Experience with Microsoft Azure, AWS, or Google Cloud, especially managed Kubernetes, container registries, object storage, private networking, identity, secrets, and monitoring services.
Experience building multi-tenant platforms where reliability, fairness, quota management, isolation, and security are important.
Experience with sandboxed execution environments, remote development environments, hosted notebook/tool environments, evaluation infrastructure, or ephemeral compute platforms.
Experience with container image build systems, registry authentication, image caching, package caching, artifact distribution, or startup-latency optimization.
Experience with cloud networking concepts such as ingress, DNS, proxies, egress control, private endpoints, service routing, and traffic management.
Experience with secure runtime design, including authentication, authorization, workload identity, secret handling, network isolation, and protecting shared infrastructure from untrusted workloads.
Experience with AI infrastructure, agent execution, evaluation platforms, GPU workloads, Windows/Linux runtime environments, or VM/container hybrid systems.
Experience improving service operability through structured logging, distributed tracing, dashboards, alerting, automated validation, and incident playbooks.

Software Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800.00 - $234,700.00 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $160,200.00 - $261,000.00 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**