Principal Software Engineer

Microsoft Microsoft · Big Tech · Bengaluru, KA, IN · Software Engineering

This Principal Software Engineer role focuses on building and operating large-scale, real-time data pipelines and feature/embedding materialization systems for Microsoft Ads. The role involves designing and implementing streaming ETL, managing messaging systems like Kafka, ensuring data contracts, and integrating with ML inference serving. Success metrics include freshness, correctness, latency, reliability, and cost. The role requires strong programming skills, experience with distributed systems, and observability.

What you'd actually do

  1. Design and implement real-time streaming ETL / feature pipelines (e.g., Flink or Spark Structured Streaming) that meet strict freshness and correctness constraints.
  2. Build and operate reliable messaging and ingestion with Kafka/Pulsar (partitioning strategy, retries, ordering guarantees, DLQs, backpressure handling).
  3. Own data contracts between producers, pipelines, and consumers: schema evolution, versioning, compatibility, validation, and safe rollout.
  4. Implement production-grade backfill/replay workflows
  5. Define and meet SLOs using OpenTelemetry/Prometheus/Grafana for metrics, tracing, dashboards, alerting, and incident response readiness.

Skills

Required

  • Bachelor’s or Master’s degree in Computer Science, Electrical/Computer Engineering, or a related field, with 8+ years of related experience.
  • Strong programming skills in language C++,C# or Python (at least one required).
  • Building and operating streaming data pipelines in production (Flink or Spark Structured Streaming)
  • Distributed systems engineering with strong reliability and operational rigor
  • Messaging systems such as Kafka/Pulsar.
  • Operating services with Kubernetes/containers and production readiness practices (deployments, scaling, rollbacks).
  • Observability stacks such as OpenTelemetry, Prometheus, Grafana.
  • Ability to debug complex production issues using logs/metrics/traces and performance profiling.
  • Strong communication and collaboration skills

Nice to have

  • Experience with feature stores, embedding pipelines, and online/offline consistency (freshness guarantees, correctness validation).
  • Experience with data lakehouse/table formats and optimizations eg partitioning, compaction, and incremental processing.
  • Experience with GPU inference serving (Triton, ONNX Runtime/TensorRT) and performance techniques (batching, request shaping, tail-latency reduction).
  • understanding of pipeline correctness patterns: idempotency, dedup, watermarking, late data, exactly-once vs at-least-once tradeoffs.
  • Background in cost/performance modeling, capacity planning, and reliability improvements for high-scale data platforms.
  • Experience in Ads/search/recommendations or other high-scale systems where freshness, latency, and cost are jointly optimized.

What the JD emphasized

  • real-time data
  • low-latency serving
  • ML models
  • massive scale
  • strict freshness, cost, and reliability requirements
  • real-time data pipelines
  • feature/embedding materialization systems
  • ML inference serving
  • robust streaming + ETL systems
  • owning SLOs
  • strong observability
  • operational maturity
  • optimizing end-to-end performance and cost
  • freshness, correctness, latency, reliability, and cost in production
  • strict freshness and correctness constraints
  • reliable messaging and ingestion
  • production-grade backfill/replay workflows
  • Define and meet SLOs
  • production readiness practices
  • debug complex production issues
  • feature stores, embedding pipelines, and online/offline consistency

Other signals

  • ML models
  • feature stores
  • embedding pipelines
  • LLM API calls
  • inference serving