Senior Software Engineer, Coreai

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Software Engineering

Senior Software Engineer on the FIT training team within CoreAI at Microsoft, focused on building and optimizing the AI infrastructure for training agentic AI systems and LLMs/SLMs to achieve frontier-level performance. The role involves developing scalable infrastructure, working with both proprietary and open-source frameworks, and ensuring enterprise-grade agentic workflows.

What you'd actually do

  1. Collaboration with engineers and researchers to build and optimize training infrastructure and tools for LLMs, SLMs, multimodal, and code-specific models.
  2. Design, build and improve services with high scalability and reliability.
  3. Design and implement the services to serve the prod traffic and fulfill the security and privacy requirements.
  4. Participate in efforts to deliver and improve engineering systems and practices to ensure service quality in complex cloud environments.
  5. Contribute to the deployment and monitoring of services in production environments.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field and 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience.
  • 5+ years of software engineering experience, with significant ownership of production services, cloud platforms, distributed systems, or developer infrastructure.
  • Strong experience building and operating containerized platforms using Kubernetes or similar orchestration systems.
  • Strong coding skills in one or more systems or backend languages such as Python, Go, Rust, C++, C#, or Java.
  • Experience designing reliable production APIs, backend services, or control-plane systems that manage compute, storage, networking, or runtime environments.
  • Solid understanding of cloud infrastructure fundamentals, including identity, networking, storage, observability, capacity planning, security, and safe deployment practices.
  • Experience diagnosing production issues using logs, metrics, traces, dashboards, and incident response processes.
  • Demonstrated ability to lead technical design, drive ambiguous projects to completion, mentor other engineers, and collaborate across teams.

Nice to have

  • Experience with Microsoft Azure, AWS, or Google Cloud, especially managed Kubernetes, container registries, object storage, private networking, identity, secrets, and monitoring services.
  • Experience building multi-tenant platforms where reliability, fairness, quota management, isolation, and security are important.
  • Experience with sandboxed execution environments, remote development environments, hosted notebook/tool environments, evaluation infrastructure, or ephemeral compute platforms.
  • Experience with container image build systems, registry authentication, image caching, package caching, artifact distribution, or startup-latency optimization.
  • Experience with cloud networking concepts such as ingress, DNS, proxies, egress control, private endpoints, service routing, and traffic management.
  • Experience with secure runtime design, including authentication, authorization, workload identity, secret handling, network isolation, and protecting shared infrastructure from untrusted workloads.
  • Experience with AI infrastructure, agent execution, evaluation platforms, GPU workloads, Windows/Linux runtime environments, or VM/container hybrid systems.
  • Experience improving service operability through structured logging, distributed tracing, dashboards, alerting, automated validation, and incident playbooks.

What the JD emphasized

  • lead and role models
  • efficient code
  • debug complex training jobs
  • document findings
  • track record of continuous improvement
  • agile, startup-style mindset
  • iterate quickly
  • pivot when needed
  • collaborate effectively
  • fast-paced, dynamic environments

Other signals

  • training infrastructure
  • agentic AI systems
  • frontier-level performance
  • LLMs, SLMs, and agentic models