Senior Data Engineer, Telemetry and Inf… at NVIDIA

What you'd actually do

Define and execute the group’s data technical roadmap, aligning with R&D, hardware, and DevOps teams

Design and maintain flexible ETL/ELT frameworks for ingesting, transforming, and classifying telemetry and performance data

Build and optimize streaming pipelines using Apache Spark, Kafka, and Databricks, ensuring high throughput, reliability, and adaptability to evolving data schemas

Implement and maintain observability and data quality standards, including schema validation, lineage tracking, and metadata management

Deliver reliable insights for cluster performance analysis, telemetry visibility, and end-to-end test coverage

Skills

Required

5+ years of hands-on experience in data engineering or backend development
Strong practical experience with Apache Spark (PySpark or Scala) and Databricks
Expertise with Apache Kafka, including stream ingestion, schema registry, and event processing
Proficiency in Python and SQL for data transformation, automation, and pipeline logic
Familiarity with ETL orchestration tools (Airflow, Prefect, or Dagster)
Experience with schema evolution, data versioning, and validation frameworks (Delta Lake, Iceberg, or Great Expectations)
Solid understanding of cloud environments (AWS preferred; GCP or Azure also relevant)
Knowledge of streaming and telemetry data architectures in large-scale, distributed systems

Nice to have

Exposure to hardware, firmware, or embedded telemetry environments.
Experience with real-time analytics frameworks (Spark Structured Streaming, Flink, Kafka Streams)
Experience with data cataloging or governance tools (DataHub, Collibra, or Alation)
Familiarity with CI/CD for data pipelines and infrastructure-as-code (Terraform, GitHub Actions)
Experience designing performance metrics data systems (latency, throughput, resource utilization) that support high-volume, high-frequency telemetry at scale

What the JD emphasized

massive volumes of real-time telemetry data

large-scale AI and HPC clusters

high throughput, reliability, and adaptability to evolving data schemas

observability and data quality standards

schema validation, lineage tracking, and metadata management

streaming and telemetry data architectures in large-scale, distributed systems

We’re looking for an experienced Data Engineer to join our Networking Cluster Solutions (NCS) group and help shape the data backbone behind NVIDIA’s advanced R&D telemetry and performance analytics ecosystem.

In this role, you’ll design, build, and maintain scalable, high-performance data pipelines that handle massive volumes of real-time telemetry data from hardware, communication modules, firmware, and large-scale AI and HPC clusters. You’ll work in a cutting-edge environment that combines software, hardware, and data infrastructure — driving the next generation of NVIDIA’s cluster performance and monitoring technologies.

The NCS group builds and operates some of NVIDIA Israel’s most advanced AI and HPC clusters, optimizing networking and compute performance at scale.

What you’ll be doing:

Define and execute the group’s data technical roadmap, aligning with R&D, hardware, and DevOps teams
Design and maintain flexible ETL/ELT frameworks for ingesting, transforming, and classifying telemetry and performance data
Build and optimize streaming pipelines using Apache Spark, Kafka, and Databricks, ensuring high throughput, reliability, and adaptability to evolving data schemas
Implement and maintain observability and data quality standards, including schema validation, lineage tracking, and metadata management
Develop monitoring and alerting for pipeline health using Prometheus, Grafana, or Datadog
Support self-service analytics for engineers and researchers via Databricks notebooks, APIs, and curated datasets
Promote best practices in data modeling, code quality, security, and operational excellence across the organization
Deliver reliable insights for cluster performance analysis, telemetry visibility, and end-to-end test coverage

What we need to see:

B.Sc. or M.Sc. in Computer Science, Computer Engineering, or a related field
5+ years of hands-on experience in data engineering or backend development
Strong practical experience with Apache Spark (PySpark or Scala) and Databricks
Expertise with Apache Kafka, including stream ingestion, schema registry, and event processing
Proficiency in Python and SQL for data transformation, automation, and pipeline logic
Familiarity with ETL orchestration tools (Airflow, Prefect, or Dagster)
Experience with schema evolution, data versioning, and validation frameworks (Delta Lake, Iceberg, or Great Expectations)
Solid understanding of cloud environments (AWS preferred; GCP or Azure also relevant)
Knowledge of streaming and telemetry data architectures in large-scale, distributed systems

Ways to stand out from the crowd:

Exposure to hardware, firmware, or embedded telemetry environments.
Experience with real-time analytics frameworks (Spark Structured Streaming, Flink, Kafka Streams)
Experience with data cataloging or governance tools (DataHub, Collibra, or Alation)
Familiarity with CI/CD for data pipelines and infrastructure-as-code (Terraform, GitHub Actions)
Experience designing performance metrics data systems (latency, throughput, resource utilization) that support high-volume, high-frequency telemetry at scale

With competitive salaries and a generous benefits package, NVIDIA is widely considered one of the technology world’s most desirable employers. Our team comprises some of the most forward-thinking and hardworking individuals in the industry. Due to unprecedented growth, our exclusive engineering teams are rapidly expanding. If you're a creative engineer with a real passion for technology, we want to hear from you.

The NCS group builds and operates some of NVIDIA Israel’s most advanced AI and HPC clusters, optimizing networking and compute performance at scale.

What you’ll be doing:

Define and execute the group’s data technical roadmap, aligning with R&D, hardware, and DevOps teams
Design and maintain flexible ETL/ELT frameworks for ingesting, transforming, and classifying telemetry and performance data
Build and optimize streaming pipelines using Apache Spark, Kafka, and Databricks, ensuring high throughput, reliability, and adaptability to evolving data schemas
Implement and maintain observability and data quality standards, including schema validation, lineage tracking, and metadata management
Develop monitoring and alerting for pipeline health using Prometheus, Grafana, or Datadog
Support self-service analytics for engineers and researchers via Databricks notebooks, APIs, and curated datasets
Promote best practices in data modeling, code quality, security, and operational excellence across the organization
Deliver reliable insights for cluster performance analysis, telemetry visibility, and end-to-end test coverage

What we need to see:

B.Sc. or M.Sc. in Computer Science, Computer Engineering, or a related field
5+ years of hands-on experience in data engineering or backend development
Strong practical experience with Apache Spark (PySpark or Scala) and Databricks
Expertise with Apache Kafka, including stream ingestion, schema registry, and event processing
Proficiency in Python and SQL for data transformation, automation, and pipeline logic
Familiarity with ETL orchestration tools (Airflow, Prefect, or Dagster)
Experience with schema evolution, data versioning, and validation frameworks (Delta Lake, Iceberg, or Great Expectations)
Solid understanding of cloud environments (AWS preferred; GCP or Azure also relevant)
Knowledge of streaming and telemetry data architectures in large-scale, distributed systems

Ways to stand out from the crowd:

Exposure to hardware, firmware, or embedded telemetry environments.
Experience with real-time analytics frameworks (Spark Structured Streaming, Flink, Kafka Streams)
Experience with data cataloging or governance tools (DataHub, Collibra, or Alation)
Familiarity with CI/CD for data pipelines and infrastructure-as-code (Terraform, GitHub Actions)
Experience designing performance metrics data systems (latency, throughput, resource utilization) that support high-volume, high-frequency telemetry at scale