Principal Software Engineer

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Software Engineering

Principal Software Engineer to advance ad-serving infrastructure, focusing on performance, efficiency, and scalability of next-gen model serving and inference platforms for Ads. Designs and optimizes high-performance serving systems and GPU inference frameworks for deep learning and LLM workloads.

What you'd actually do

  1. Design and lead the development of large-scale, distributed online serving systems—including GPU-accelerated and CPU-based ranking/inference pipelines—to process millions of ad requests per second with ultra-low latency, high throughput, and solid reliability.
  2. Architect and optimize end-to-end inference infrastructure, including model serving, batching/streaming, caching, scheduling, and resource orchestration across heterogeneous hardware (GPU, CPU, and memory tiers).
  3. Profile and optimize performance across the full stack—from CUDA kernels and GPU pipelines to CPU threads and OS-level scheduling—identifying bottlenecks, tuning latency tails, and improving cost efficiency through advanced profiling and instrumentation.
  4. Own live-site reliability as a DRI: design telemetry, alerting, and fault-tolerance mechanisms; drive rapid diagnosis and mitigation of performance regressions or outages in globally distributed systems.
  5. Collaborate and mentor across teams—driving architecture reviews, enforcing engineering excellence, promoting system-level optimization practices, and mentoring others in deep debugging, profiling, and performance engineering.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field
  • 6+ years technical engineering experience
  • coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

Nice to have

  • Master's Degree in Computer Science or related technical field
  • 8+ years technical engineering experience
  • 12+ years technical engineering experience
  • Industry experience in advertising or search engine backend systems
  • large-scale ad ranking
  • real-time bidding (RTB)
  • relevance-serving infrastructure
  • real-time data streaming systems (Kafka, Flink, Spark Streaming)
  • feature-store integration
  • multi-region deployment
  • LLM inference optimization
  • model sharding
  • tensor/kv-cache parallelism
  • paged attention
  • continuous batching
  • quantization (AWQ/FP8)
  • hybrid CPU–GPU orchestration
  • SLA-based capacity forecasting
  • autoscaling
  • performance telemetry
  • cross-functional architecture initiatives
  • technical mentorship
  • observability
  • NVIDIA Triton Inference Server
  • CUDA
  • TensorRT
  • custom CUDA kernels
  • memory movement (H2D/D2H)
  • overlapping compute and I/O
  • GPU occupancy
  • kernel fusion
  • batching vs. streaming
  • latency vs. throughput
  • quantization (FP16/BF16/INT8)
  • dynamic batching
  • continuous model rollout
  • adaptive inference scheduling
  • tensor/memory alignment
  • compute–memory balancing
  • embedding table management
  • parameter servers
  • hierarchical caching
  • vectorized inference
  • transformer/LLM architectures
  • multi-threading
  • process scheduling
  • NUMA-aware memory allocation
  • lock-free data structures
  • context switching
  • I/O stack tuning (NVMe, RDMA)
  • kernel bypass (DPDK, io_uring)
  • CPU/GPU

What the JD emphasized

  • ultra-low latency
  • low-latency
  • performance engineering
  • deep systems debugging
  • GPU inference frameworks
  • CUDA kernels
  • model serving trade-offs
  • optimize GPU and system workloads
  • low-level system and OS internals

Other signals

  • GPU inference
  • LLM inference optimization
  • low-latency serving