Currently tracking 427 active AI roles, up 208% versus the prior 4 weeks. Primary focus: Agent · Engineering. Salary range $65k–$331k (avg $193k).
| Title | Stage | AI score |
|---|---|---|
| MTS - Site Reliability Engineer This role is for a Site Reliability Engineer (SRE) focused on ensuring the reliability, availability, and efficiency of large-scale distributed AI infrastructure. The SRE will work with ML researchers, data engineers, and product developers to operate platforms for training, fine-tuning, and serving generative AI models. Key responsibilities include maintaining uptime, designing observability systems, optimizing performance, building automation for deployments and incident response, and ensuring security and compliance in hybrid cloud/on-prem CPU+GPU environments. The role requires strong experience in SRE/DevOps, Kubernetes, CI/CD, public cloud platforms, monitoring tools, and programming languages like Python or Go, with a preference for experience with large-scale GPU clusters and HPC. | Serve | 5 |
| Senior Software Engineer Senior Software Engineer on the Ads Data Platform Team, part of Microsoft AI. The role involves designing and operating high-scale data platforms that process billions of events daily, supporting business analytics, machine learning models, and real-time reporting for Microsoft Ads. Focus on distributed systems, machine learning, and big data technologies. |
| Serve |
| 5 |
| Research Intern - Azure Storage Research intern role focused on optimizing storage systems for AI workloads, including training, checkpointing, and inferencing. The role involves working with leading-edge AI customers to gain insights into their needs. | Serve | 5 |
| Software Engineer II Software Engineer II role focused on building an AI-powered engineering system for Windows OS updates, aiming for efficiency, scale, and re-bootless deployment. The role involves leveraging AI to solve engineering and scale problems, working with cloud-native applications on Azure, and utilizing OpenAI APIs or similar LLM platforms. Requires strong understanding of Windows OS internals and experience with databases. | Serve | 5 |
| Principal Consultant A2 - Infra This role is for a Principal Consultant focused on AI-first infrastructure delivery. The consultant will lead the technical execution of complex client projects, embedding AI-native thinking into delivery models. Responsibilities include designing, building, and optimizing cloud and on-premises infrastructure solutions, leveraging automation and intelligent orchestration, and ensuring secure, scalable, and high-performing environments. The role requires proficiency in Python for AI integration and scripting, and experience with Azure AI Services. | Serve | 5 |
| Software Engineer II Software Engineer II role focused on designing and building next-generation networking infrastructure for large-scale AI training and inference in Azure Cloud. The role involves developing high-performance, low-latency, and reliable networking capabilities to support distributed AI workloads, working at the intersection of AI and high-performance computing. | Serve | 5 |
| Member of Technical Staff, Infrastructure Engineer This role focuses on building and scaling the backend platform for Microsoft's consumer AI products, specifically powering Copilot. It involves designing, developing, and maintaining AI Platform services, collaborating with AI researchers and engineers, and ensuring the reliability and scalability of platform components. The role requires strong experience in backend technologies, public cloud infrastructure, and production software development. | Serve | 5 |
| Member of Technical Staff, Hardware Health - MAI Superintelligence Team This role focuses on ensuring the reliability, performance, and availability of large-scale AI training infrastructures, specifically GPU clusters. It involves designing and developing hardware health monitoring and diagnostic frameworks, building predictive analytics pipelines using telemetry data, and leading incident triage for hardware anomalies. The goal is to drive automation in health management and partner with cross-functional teams to improve hardware design for reliability. | Serve | 5 |