Senior Software Engineer, Big Data

Zillow Zillow · Consumer · United States · Remote

This role focuses on designing, building, and operating large-scale Kafka and Flink streaming infrastructure. The engineer will lead initiatives for platform modernization, develop control planes and tooling, improve reliability, and integrate with real-time AI patterns. The role emphasizes system design, operational excellence, and mentoring.

What you'd actually do

  1. Design, build, and operate large‑scale Kafka and Flink infrastructure supporting tier‑0 and tier‑1 workloads.
  2. Lead critical initiatives in our streaming platform modernization, including platform architecture evolution.
  3. Develop and enhance streaming control planes, APIs, CLIs, and provisioning systems that standardize how teams create and operate streaming resources across Zillow.
  4. Improve platform reliability through SLO definition, monitoring, alerting, incident response, and automation.
  5. Evaluate and integrate modern streaming ecosystem capabilities, including managed Kafka offerings, serverless stream processing, and real‑time AI integration patterns.

Skills

Required

  • 5+ years of experience building and operating large‑scale distributed systems
  • Significant production experience with Kafka and/or Flink
  • Proficiency in at least one programming language such as Python, Java, or Scala
  • Experience operating services in cloud environments (for example, AWS)
  • Experience working with container orchestration platforms like Kubernetes
  • Experience designing scalable, multi-tenant systems
  • Experience defining and operating against SLOs
  • Experience participating in on‑call rotations
  • Experience leading incident response efforts
  • Strong systems design skills

Nice to have

  • Experience working with streaming vendors (for example, Confluent, MSK, Redpanda)
  • Modernizing legacy Kafka/Flink infrastructure
  • Demonstrated experience leading system design efforts for complex, multi‑team platform initiatives
  • Experience integrating streaming systems with analytics platforms such as Databricks
  • Building real‑time context engineering capabilities for AI systems
  • Background in reliability engineering or platform engineering
  • Familiarity with infrastructure‑as‑code tooling such as Terraform
  • Familiarity with CI/CD systems

What the JD emphasized

  • independently owning critical production systems end to end
  • significant production experience with Kafka and/or Flink
  • design scalable, multi-tenant systems with reliability, cost efficiency, and observability in mind
  • defining and operating against SLOs, participating in on‑call rotations, and leading incident response efforts