Senior Software Engineer - Data Lake & Bi

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

The role is for a Senior Software Engineer focused on building and evolving a planet-scale performance data warehouse for an AI cloud provider. The engineer will own the architecture for ingesting, storing, transforming, and surfacing performance data, turning raw events into actionable insights for engineering and business decisions. Key responsibilities include data lake architecture, schema design, time-series metrics infrastructure, BI/visualization, and query optimization. The role requires strong experience in distributed systems, data platforms, Python/Go, Kubernetes, data lake architectures, columnar databases, time-series databases, and BI tools. Experience with MLPerf or benchmarking GPU fleets is preferred.

What you'd actually do

  1. Design and build our core performance data lake on columnar storage foundations.
  2. Define and govern schemas for performance telemetry: latency distributions, throughput metrics, GPU utilization, cost-per-token, and hardware health signals.
  3. Own and extend our time-series database (TSDB) layer.
  4. Build compelling, self-service BI views and dashboards (Grafana, Looker, or similar) that translate raw performance data into clear stories for engineers, product managers, and executives.
  5. Profile and tune query engines against columnar and time-series stores; reduce scan times, optimize join strategies, and introduce materialized views or pre-aggregations where they matter most.

Skills

Required

  • 5+ years of experience building distributed systems, data platforms, or cloud services.
  • Strong coding in Python or Go
  • deep familiarity with networked systems and performance.
  • Hands-on experience with Kubernetes at production scale, CI/CD, and observability stacks (Prometheus, Grafana, OpenTelemetry).
  • Demonstrated expertise with data lake architectures, columnar databases, and modern table formats (Iceberg, Parquet, Avro)
  • Practical experience designing and managing hot/cold storage tiers for large-scale analytical workloads.
  • Strong schema design instincts
  • Working knowledge of time-series databases and fluency in PromQL or MetricsQL
  • Experience building BI views, visualizations, and data-driven playbooks
  • Strong communicator comfortable collaborating with cross-functional teams and external partners.

Nice to have

  • Experience with time-series databases, LSM-based storage engines, or custom data pipelines.
  • Experience running MLPerf submissions or similar large-scale audited benchmarks.
  • Contributions to OSS projects such as Apache Iceberg, Apache Spark, Trino, llm-d, vLLM, or PyTorch.
  • Exposure to benchmarking large GPU fleets or multi-region clusters.
  • Experience with CUDA kernels, NCCL/SHARP, RDMA/NUMA, or GPU interconnect topologies.
  • Familiarity with data cataloging, lineage tools, or data governance frameworks.

What the JD emphasized

  • planet-scale performance data warehouse
  • ingest, store, transform, and surface performance data
  • billions of raw events into trusted, queryable insights
  • data foundations that underpin industry-leading benchmark publications
  • internal performance SLAs
  • executive-level reporting
  • authoritative, reproducible, and actionable
  • performance data warehouse
  • performance telemetry
  • time-series database (TSDB) layer
  • performance data
  • performance claim is backed by a reproducible query and a versioned dataset
  • query engines
  • P99 latency and freshness SLAs
  • running MLPerf submissions or similar large-scale audited benchmarks