Principal Software Engineer - Azure AI Translation & Language Team

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Software Engineering

The Principal Software Engineer will design and implement large-scale distributed systems for Azure AI Translation and Language services, focusing on infrastructure for model inference, service reliability, and platform architecture. This role involves defining and evolving platform architecture for high availability, scalability, and performance, driving improvements in reliability and operational excellence, and building core infrastructure components. Collaboration with applied science and product teams is key, as is mentoring engineers.

What you'd actually do

  1. Lead the design and implementation of large-scale, distributed systems that power Azure AI translation and language services.
  2. Define and evolve platform architecture for high availability, scalability, and performance across global deployments.
  3. Drive improvements in reliability, fault tolerance, and operational excellence for mission-critical services.
  4. Build and enhance core infrastructure components such as service orchestration, workload management, and data pipelines.
  5. Establish best practices for service observability, monitoring, alerting, and incident response.

Skills

Required

  • Bachelor’s Degree in Computer Science, Engineering, or related field AND 6+ years of software development experience with coding in languages including, but not limited to, C, C++, C#, Java, or Rust
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

  • 5+ years of experience designing and building large-scale distributed systems or cloud infrastructure.
  • Experience building and operating highly available services with strict requirements for latency, scalability, and reliability.
  • Experience with cloud platforms (e.g., Azure) and service-oriented or microservices architectures.
  • Experience building infrastructure for AI/ML services, such as model serving platforms, data processing systems, or training pipelines.
  • Experience with system performance optimization, capacity planning, and cost-efficiency at scale.
  • Experience designing globally distributed systems and handling multi-region deployments.
  • Experience with reliability engineering practices, including incident management and postmortem analysis.
  • Programming experience in Python.
  • Master’s or PhD in Computer Science, Engineering, or related field.

What the JD emphasized

  • large-scale distributed systems
  • high availability, scalability, and performance
  • reliability, fault tolerance, and operational excellence
  • service observability, monitoring, alerting, and incident response
  • AI/ML services

Other signals

  • large-scale distributed systems
  • high-performance translation and language services
  • model inference workloads
  • platform architecture
  • service reliability