Senior Software Engineer

Microsoft Microsoft · Big Tech · United States · Software Engineering

This role focuses on designing and developing next-generation networking infrastructure for large-scale AI training and inference in Azure Cloud. The engineer will work on high-performance, low-latency, and low-jitter communication frameworks, optimizing scalability and reliability for distributed AI workloads.

What you'd actually do

  1. Design, develop, and optimize networking solutions tailored for large-scale AI training infrastructure. Architect and implement high-performance, low-latency, and low-jitter communication frameworks for distributed systems.
  2. Benchmark, analyze, and enhance the scalability and reliability of networking systems to handle petabyte-scale data transfer.
  3. Debug and resolve complex networking issues in large-scale, high-performance environments.
  4. Drive identification of dependencies and the development of design documents for a product, application, service, or platform.
  5. Create, implement, optimize, debug, refactor, and reuse code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
  • Ability to meet Microsoft, customer and/or government security screening requirements

Nice to have

  • Bachelor's Degree in Computer Science OR related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python OR Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • In-depth understanding of networking protocols (e.g., Ethernet, TCP/IP, RDMA, gRPC) and distributed systems.
  • Familiarity with network virtualization, software-defined networking (SDN), or network performance tuning.
  • Hands-on experience with networking technologies in AI-specific hardware (e.g., InfiniBand, ROCE, NVLink).
  • Familiarity with AI accelerators such as GPUs (NVIDIA, AMD) or TPUs, and how they interact with networking infrastructure.
  • Experience with telemetry and observability tools for network monitoring at scale.
  • Background in building scalable and fault-tolerant systems in large, distributed environments.

What the JD emphasized

  • high-performance
  • low-latency
  • scalability
  • reliability
  • observability
  • large-scale AI training
  • large-scale systems
  • low-latency systems
  • large, distributed environments

Other signals

  • AI supercomputer
  • large-scale AI training
  • networking infrastructure
  • low-latency
  • high-performance