Senior AI and Hpc Observability Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

NVIDIA is seeking a Senior AI and HPC Observability Engineer to design and scale observability platforms and telemetry pipelines for AI/ML systems. The role involves building high-performance backend services, developing OpenTelemetry components, optimizing metrics pipelines, and ensuring platform reliability in distributed environments, with a focus on AI and HPC workloads.

What you'd actually do

  1. Design and scale observability platforms handling high-volume metrics, logs, and traces across distributed environments
  2. Build high-performance backend services for telemetry ingestion, processing, and routing
  3. Develop and extend OpenTelemetry collectors, processors, exporters, and instrumentation libraries
  4. Build and optimize metrics pipelines using large-scale time-series storage systems
  5. Design and operate real-time and batch telemetry pipelines using streaming and distributed data technologies

Skills

Required

  • Python, Go, or Java
  • modern observability architectures
  • PromQL
  • time-series data systems
  • distributed data pipelines (Kafka, Spark, or Flink)
  • Kubernetes
  • cloud-native infrastructure
  • distributed systems
  • concurrency
  • fault-tolerant system design
  • debugging
  • performance tuning
  • production operations

Nice to have

  • AI, GPU, or HPC environments
  • OpenTelemetry
  • Prometheus
  • Kafka
  • data engineering
  • time-series data modeling
  • real-time performance tuning
  • integrating observability with AI/ML pipelines
  • GPU workload monitoring
  • intelligent alerting
  • statistical or machine learning techniques for anomaly detection, correlation, or predictive insights

What the JD emphasized

  • production environments
  • production-quality software
  • production operations skills
  • AI, GPU, or HPC environments
  • high-volume distributed telemetry pipelines

Other signals

  • observability platforms
  • telemetry pipelines
  • high-throughput
  • distributed systems
  • production-grade coding
  • operational excellence
  • AI & HPC environments