Principal Software Engineer

Microsoft Microsoft · Big Tech · United States · Software Engineering

Principal Software Engineer role focused on designing, developing, and optimizing networking infrastructure for large-scale AI training and inference in Azure Cloud. The role emphasizes high performance, low latency, and reliability for distributed AI workloads, working with AI accelerators and advanced networking technologies.

What you'd actually do

  1. Design, develop, and optimize networking solutions tailored for large-scale AI training infrastructure. Architect and implement high-performance, low-latency, and low-jitter communication frameworks for distributed systems.
  2. Benchmark, analyze, and enhance the scalability and reliability of networking systems to handle petabyte-scale data transfer.
  3. Debug and resolve complex networking issues in large-scale, high-performance environments.
  4. Act as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate.
  5. Proactively seek new knowledge and adapts to new AI trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.

Skills

Required

  • C
  • C++
  • C#
  • Java
  • JavaScript
  • Python
  • networking technologies in AI-specific hardware
  • networking protocols
  • distributed systems
  • network performance tuning
  • AI accelerators
  • telemetry and observability tools
  • scalable and fault-tolerant systems

Nice to have

  • InfiniBand
  • ROCE
  • NVLink
  • Ethernet
  • TCP/IP
  • RDMA
  • gRPC
  • network virtualization
  • software-defined networking (SDN)
  • NVIDIA GPUs
  • AMD GPUs
  • TPUs

What the JD emphasized

  • high-performance
  • low-latency
  • scalability
  • reliability
  • observability
  • large-scale AI training

Other signals

  • large-scale AI training infrastructure
  • high-performance networking
  • low-latency systems
  • distributed AI workloads
  • AI accelerators