Principal Software Engineering-coreai

Microsoft · Big Tech · Redmond, WA +1 · Software Engineering

Principal Software Engineer at Microsoft Foundry, Core AI, focused on building and scaling the platform for intelligent agents and generative AI systems. This role involves driving technical direction, architectural decisions, and ensuring quality, reliability, security, and compliance for large-scale AI systems. The engineer will mentor others, influence without authority, and champion automation and responsible AI practices.

What you'd actually do

Drives the improvement of artificial intelligence (AI) tools and practices across the software development lifecycle (SDLC).
Provides technical leadership during code reviews for a solution/product area to assure it meets team standards, contains the correct test coverage, and is appropriate for the product or solution area.
Establishes best practices and mentors others to create a clear test strategy that ensures solution quality, prevents regression from being introduced into existing code.
Designs and executes plans for redesigning or rearchitecting difficult or untestable sections of code across solutions and/or products.
Leverages artificial intelligence (AI) tools for test automation.

Skills

Required

Software development lifecycle (SDLC)
Architectural decisions
Large-scale systems
AI technologies
Quality, reliability, security, and compliance
Customer and developer needs analysis
Technical leadership
Mentoring engineers
Influencing without authority
Design principles
Code reviews
Continuous learning
Collaboration with partner teams
Automation
Operational excellence
Secure-by-design practices
Responsible AI practices
AI tools and practices
Coding standards
Automated source code analysis
Extensible, maintainable, well-tested, secure, and performant code
Code performance optimization
Debugging tools
Testing frameworks
Incident retrospectives
Least-access principles
Privacy and security
Test strategy
Security testing
Test automation
System architecture

Nice to have

Generative AI (GenAI)

What the JD emphasized

critical role in building and evolving the platform that enables developers and enterprises to design, deploy, and scale intelligent agents and generative AI systems
owning architectural decisions for complex, large‑scale systems that integrate cutting‑edge AI technologies
highest standards of quality, reliability, security, and compliance
technical leadership across teams
accelerating value to customers
raise the engineering bar through strong design principles, rigorous code reviews, and a culture of continuous learning
seamless integration, scalable architectures, and robust deployment and testing frameworks
champion automation, operational excellence, and secure‑by‑design practices
define how AI systems and agent platforms are built responsibly and at scale
shaping how the world interacts with intelligent systems
Incorporates Responsible AI practices into the SDLC
Experiments with AI tools and practices to improve their own capabilities
Leads by example across teams and mentors others to produce extensible, maintainable, well-tested, secure, and performant code used across the company that adheres to design specifications.
Leads efforts to continuously improve code performance, testability, maintainability, effectiveness, and cost, while accounting for and incorporating relevant trade-offs.
Creates and applies metrics to drive code quality and stability, appropriate coding patterns, and best practices.
Leads efforts to identify and anticipate blockers or unknowns during the development process, escalate them, and communicate how they will impact timelines, and then drives the identification and implementation of strategies and/or opportunities to address them.
Acts as an expert on using debugging tools, tests, logs, telemetry, and other methods, and proactively leads verification of assumptions through while developing code before issues occur across products and teams in production.
Leverages minimal telemetry data, triangulates issues, and resolves with minimal iterations.
Leads incident retrospectives to identify root causes of problems, and owns the implementation of repair actions and the identification of mechanisms to prevent incident recurrence.
Drives applying least-access principles, using logging, telemetry, and other appropriate mechanisms to investigate issues while retaining privacy and security, and champions those practices across the team.
Establishes best practices and mentors others to create a clear test strategy that ensures solution quality, prevents regression from being introduced into existing code.
Establishes best practices and mentors others on ensuring test plans incorporate security testing to validate security invariants (including negative cases).
Provides technical leadership on adding new tests to cover gaps, deleting or fixing broken tests, and improving the speed, reliability, and defect localization of the overall test suite across a solution or product.
Mentors others on, and builds testable code and considers testability during design across solutions and/or products.
Acts as a thought leader for understanding different types of tests that can be done on a particular system (e.g., unit tests), and maintaining up-to-date understanding of testing architectures used both across Microsoft and across the industry, and applies them across the architecture as appropriate.
Designs and executes plans for redesigning or rearchitecting difficult or untestable sections of code across solutions and/or products.
Leverages artificial intelligence (AI) tools for test automation.
Oversees, influences, and owns efforts and design discussions for the overall system architecture of entire pr

Other signals

building and evolving the platform that enables developers and enterprises to design, deploy, and scale intelligent agents and generative AI systems
drive technical direction across the full software development lifecycle, owning architectural decisions for complex, large‑scale systems that integrate cutting‑edge AI technologies
champion automation, operational excellence, and secure‑by‑design practices, helping define how AI systems and agent platforms are built responsibly and at scale

Read full job description

Overview

As a Principal Software Engineer within Microsoft Foundry, Core AI, you will play a critical role in building and evolving the platform that enables developers and enterprises to design, deploy, and scale intelligent agents and generative AI systems. You will drive technical direction across the full software development lifecycle, owning architectural decisions for complex, large‑scale systems that integrate cutting‑edge AI technologies while meeting the highest standards of quality, reliability, security, and compliance.In this role, you will anticipate and deeply understand customer and developer needs in complex scenarios, translating them into durable platform capabilities and delightful experiences. You will provide technical leadership across teams, guiding design tradeoffs, identifying systemic challenges, and delivering solutions that create long‑term impact while accelerating value to customers. As a senior technical voice, you will mentor engineers, influence without authority, and raise the engineering bar through strong design principles, rigorous code reviews, and a culture of continuous learning.You will collaborate closely with partner teams across Core AI and Azure to ensure seamless integration, scalable architectures, and robust deployment and testing frameworks. You will champion automation, operational excellence, and secure‑by‑design practices, helping define how AI systems and agent platforms are built responsibly and at scale—ultimately shaping how the world interacts with intelligent systems.

Responsibilities

AI-Native Development

Drives the improvement of artificial intelligence (AI) tools and practices across the software development lifecycle (SDLC). Proactively takes responsibility for the content of their AI-generated requirements, design documents, code, and other assets, assisting the rest of the team to do the same. Incorporates Responsible AI practices into the SDLC to ensure appropriate controls over AI-generated assets. Intentionally applies SDLC and engineering health measures (e.g., Accelerate, SPACE framework, Engineering System Success Playbook [ESSP]) to guide improvements to processes and practices, especially those involving AI. Experiments with AI tools and practices to improve their own capabilities, and provides recommendations on how to adopt them to the rest of the team.

Coding

Provides technical leadership during code reviews for a solution/product area to assure it meets team standards, contains the correct test coverage, and is appropriate for the product or solution area. Brings expertise to code reviews to help improve code quality, proactively coaching and providing feedback to develop other engineers' skills. Ensures coding standards are followed. Screens for and establishes best practices in reviews and provides feedback on code to drive adherence to best practices. Uses automated source code analysis tools that are incorporated into the build/development process.Leads by example across teams and mentors others to produce extensible, maintainable, well-tested, secure, and performant code used across the company that adheres to design specifications. Leads efforts to continuously improve code performance, testability, maintainability, effectiveness, and cost, while accounting for and incorporating relevant trade-offs. Identifies best practices and coding patterns (e.g., leveraging state-of-the-art generative artificial intelligence [GenAI], approaches to source code organization, naming conventions) and provides deep expertise in the coding and validation strategy. Creates and applies metrics to drive code quality and stability, appropriate coding patterns, and best practices. Leads efforts to identify and anticipate blockers or unknowns during the development process, escalate them, and communicate how they will impact timelines, and then drives the identification and implementation of strategies and/or opportunities to address them.Acts as an expert on using debugging tools, tests, logs, telemetry, and other methods, and proactively leads verification of assumptions through while developing code before issues occur across products and teams in production. Leverages minimal telemetry data, triangulates issues, and resolves with minimal iterations. Leads incident retrospectives to identify root causes of problems, and owns the implementation of repair actions and the identification of mechanisms to prevent incident recurrence. Drives applying least-access principles, using logging, telemetry, and other appropriate mechanisms to investigate issues while retaining privacy and security, and champions those practices across the team.

Design

Establishes best practices and mentors others to create a clear test strategy that ensures solution quality, prevents regression from being introduced into existing code. Establishes best practices and mentors others on ensuring test plans incorporate security testing to validate security invariants (including negative cases). Provides technical leadership on adding new tests to cover gaps, deleting or fixing broken tests, and improving the speed, reliability, and defect localization of the overall test suite across a solution or product. Mentors others on, and builds testable code and considers testability during design across solutions and/or products. Acts as a thought leader for understanding different types of tests that can be done on a particular system (e.g., unit tests), and maintaining up-to-date understanding of testing architectures used both across Microsoft and across the industry, and applies them across the architecture as appropriate. Designs and executes plans for redesigning or rearchitecting difficult or untestable sections of code across solutions and/or products. Leverages artificial intelligence (AI) tools for test automation.Oversees, influences, and owns efforts and design discussions for the overall system architecture of entire products/solutions that are deeply complex and often ambiguous. Owns the testing and exploration of various design options for entire products/solutions, ensuring the strengths and weaknesses of each option are outlined and making recommendations for which design option is best. Owns creating proposals for architecture and design documents, and leads testing of hypotheses and deeply complex proposed solutions. Shares and acts on findings from investigations, owns design decisions, and oversees the less experienced team members. Leads the development of design documents that support user stories and other product requirements. Proactively identifies and evaluates new technologies to solve classes of problems, and determines and advocates for how to integrate these technologies within existing systems. Leads design discussions with the team and shares findings/learnings from investigations, owns design decisions. Provides technical leadership to ensure system architecture and individual designs meet performance, scalability, resiliency, disaster recovery, cost of goods sold (COGS), and other requirements and expectations. Upholds Microsoft standards of security, privacy, and other compliance requirements and expectations. Understands and coaches less experienced engineers on the importance of building solutions that expand upon the work of others. Leads the refinement of products through deeply complex data analytics, and makes informed decisions in engineering products through data integration. Reviews deeply complex designs/architectures within and across teams to provide recommendations for improvements.Provides technical leadership for the identification of dependencies and incorporating them into the development of design documents for a product, application, service, or platform. Leads the active identification of other teams and technologies to leverage, how they interact, and where their own system or team can support others. Helps to create relationships and links impacting upstream and downstream interactions between systems and ensures security, compliance, performance, and reliability can be achieved across the entire stack. Drives coordination and collaboration with other teams to reach common goals where dependencies and validation concerns overlap. Enables and fosters communications and proactively negotiates across teams to resolve conflicts around dependency ownership and required work. Drives agreements between dependent teams to align to the delivery schedule.

Engineering Excellence

Leads the identification of requirements for, and the comprehensive application of automation within production and deployment across complex products, targeting zero-touch deployment when possible. Runs code in simulated or other non-production environments to confirm functionality and error-free runtime across complex products.Applies and helps to create best practices and shares information with other engineers for building code based on well-established methods and secure design principles while also applying best practices for new code development and formal validation of security invariants. Leads product development and scaling to customer requirements and applies best practices for meeting scaling needs and performance expectations and security promises, and holds accountability for product/solution areas that do not meet expectations.Provides technical leadership through efforts to ensure the correct processes are followed to achieve a high degree of security, privacy, safety, and accessibility across solutions and teams. Leads in developing and assures the presence of visible evidence (e.g., audit trail) to demonstrate compliance for products. Develops and maintains a deep understanding of the implications of onboarding new technologies following expectations of compliance at Microsoft. Provides thought leadership and maintains an up-to-date understanding of both global and local regulations for technologies and system applications to ensure regulations are followed and met.Remains current by investing time and effort into being informed staying abreast of current developments. Proactively seeks new knowledge, evaluating new trends, technical solutions, and patterns, assessing how to adapt them to current problems, and shares knowledge with other engineers. Conducts learning and literary sessions to raise awareness on relevant engineering design principles (e.g., security, testability, performance, scalability, accessibility, product knowledge).Shares and teaches others best practices about new tools and strategies. Leads efforts and mentors others to build software developer tools to support easier, faster, and more effective software engineering across products. Identifies whether open source or internal code is available to address coding needs for a set of complex products, and reuses it in a responsible manner where applicable. Holds subject matter expertise in tools inside and outside current areas of expertise. Leads identification and/or creation of tools that are useful for building the product. Shares best practices and teaches others about new tools and strategies.Drives understanding and applying security best practices and establishes code invariants to model "security as code," ensuring each layer is independently secure, and minimizing risk. Supports and/or adopts, and may set security standards for clear security code review practices for a set of complex products that align with design and engineering principles to raise the security hardening for both protections and detections. Provides thought leadership on proactively incorporating deployment gates on security controls, and scanners for a set of complex products to prevent regressions and/or vulnerabilities that would have customer impact. Includes required security monitoring to ensure detection of violations. Drives collaboration with relevant security partners to define security promises and security invariants for the design of a product/solution while factoring in attacker/investigator personas for security monitoring and telemetry needs, ensure threat models and premortems validate upstream and downstream assumptions and security invariants, establish security breach drills and security incident response processes (e.g., impact analysis, containment), and ensure that artificial intelligence (AI) safety features are implemented for the AI production systems tied to a set of complex products.Drives collaborating with partner teams to ensure a set of complex products works well with the components of the partner team, ensuring proper end-to-end testing, live-site coverage, scalability, performance, and DRI escalation pathways are established before going live.

Implement

Leads efforts for experiments that determine the impact of changes using feature flags/flighting in their code, interprets results, and decides on next steps or ship decision from results. Drives identification of the correct metrics for experimentation in determining improving customer value. Drives collaboration efforts with internal partners (e.g., Data Science, product managers) to ensure incorporation of success and guard rail metrics for experimentation.Leverages their deep subject-matter expertise to partner with appropriate stakeholders (e.g., technical program managers) to lead multiple products' project plans, release plans, and work items. Breaks down long-term project vision into milestones. Guides other members for project estimation and escalates only the most critical issues. Owns efforts to ensure required security protections and detection processes are accounted for in planning. Drives efforts to ensure project plans adhere to security, privacy, and compliance requirements. Proactively drives efforts to ensure all code across multiple products/solutions is properly flighted for quicker mitigation of production incidents. Calculates capacity for planning, accounting for appropriate failover and backup/restore mechanisms for disaster recovery for a set of complex solutions. Drives making considerations for efficient operation of a set of complex products and/or solutions after it is live. Drives proactively establishing rollback plans for a set of complex products and/or solutions.Leads leveraging existing deployment frameworks in the implementation of solutions within the existing framework, driving the automation of deployment tasks where possible to ensure efficiency. Drives following safe change deployment best practices (including ensuring that flights are set correctly) for their team to minimize adverse impact to users and other services. Optimizes deployments across products and components to meet differing business objectives. Leads efforts to ensure that solutions are deployed safely, rolling out security-sensitive features only to applicable, relevant customers and scenarios to reduce the attack surface. Leads efforts to monitor dependency status and ensure that only the latest, secure versions are deployed. Leads efforts to define when rollback plans should be enacted for a set of complex products. Drives building deployment infrastructure to allow developers' private builds for a set of complex solutions to be tested in a production-like environment.

Reliability and Supportability

Acts as an expert in design and integration and signs-off on work of others across teams or multiple products on logging and telemetry in systems and products to provide feedback on system behavior such as performance, reliability, availability, usage, and implement safety mechanisms, and for allowing monitoring and investigating security-related concerns and scenarios for both live and A/B experiments for products, services, and offerings, resulting in iterative feedback loops resulting in subsequent designs. Ensures solutions are scalable, financially responsible, and meet capture/storage guidelines. Provides technical leadership in efforts to classify, and analyze complex data and analyses on a range of metrics (e.g., health of the system, where bugs might be occurring), and sets expectations for outputs (e.g., notifications, dashboards) that improve monitoring and investigating security-related concerns and scenarios, system monitoring and/or issue identification and mitigation. Proactively considers the privacy implications of telemetry code changes, and of adding new data points.Holds accountability as a designated responsible individual (DRI) and mentors other engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions. Alerts stakeholders as to status and initiates actions to restore system/product/service for complex issues. Develops a playbook for the team to resolve issues. Coordinates people and resources to ensure DRI responsibilities are covered across teams. Responds within service level agreement (SLA) timeframe. Has line of sight to incidences and plans to address emerging issues. Leads efforts to reduce incident volume, looking globally at incidences and providing broad resolutions. Escalates issues to appropriate owners.Leads efforts in the maintenance of live site service, following security best practices when responding quickly to mitigate issues while using the minimum required permissions to do so that arise on a rotational, on-call basis. Implements and helps others implement solutions and mitigations to complex issues impacting the performance or functionality of live site services. Reviews systematic issues and ensures solutions. Ensures playbooks are logical and understandable. Uses feedback from other solutions to inform preventative measures. Reviews and writes complex incident postmortem and presents insights that drive changes to reduce or eliminate incidents across teams. Drives improving troubleshooting guides (TSGs), wikis, tests, and telemetry to make on-call better, and defining user-facing support documentation and additional test coverage to reduce likelihood of future user-initiated incidents. Drives the enablement of secure operations, security monitoring, and integration with live site investigation activities. Leads efforts to identify opportunities (e.g., lunch talks, automation, practices, tools) that can be leveraged to improve the live site experience and execute on them.

Understand User Requirements

Partners with and guides appropriate internal (e.g., product manager, privacy/security subject matter expert, technical lead) and external (e.g. customer escalation team, public forums) stakeholders and leverages expertise to anticipate, determine, and confirm customer/user requirements and their feasibility for one or more complex scenarios. Proactively seeks and leverages a variety of feedback channels to incorporate customer insights into future designs or solution fixes. Leads incorporation of unwritten requirements, such as appropriate continuous feedback loops that measure actionable, quantitative (e.g., customer value, usage patterns, solution performance) and qualitative (e.g., accessibility, globalization) indicators of value. Determines additional critical metrics. Understands and leads providing feedback on, and advocating for the security and privacy needs of the customer who will be using the complex set of solutions.

Qualifications

Required/minimum qualifications

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

**Other Requirements: **

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:

Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Additional or preferred qualifications

Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
10+ years of engineering and product development expertise, taking special pride in shipping delightful experiences that solve clear customer needs
5+ years of technical leadership in defining and building developer experiences and scalable systems, preferably in the Generative AI, Machine Learning domain
Demonstrated proficiency around leveraging AI to build AI systems and a passion to make this technology accessible to everyone
Proven record of identifying challenges, making clear judgment calls on tradeoffs, and making systemic changes for lasting impact, while bringing value to customer quickly
Clarity in communication and ability to influence without authority when stakes are high

Software Engineering IC6 - The typical base pay range for this role across the U.S. is USD $163,000 - $296,400 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $220,800 - $331,200 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**