Senior System Software Engineer - Data Platform Observability

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2 · Remote

Senior System Software Engineer to lead the evolution of NVIDIA's next-generation Data & Observability Platform, serving AI, HW, and SW engineering teams. The role involves architecting high-performance ingestion pipelines, building policy enforcement systems, developing a unified user interface and APIs, optimizing storage, and driving platform automation.

What you'd actually do

  1. Architect High-Performance Ingestion: Design and build centralized telemetry pipelines capable of handling massive scale. You will solve global latency challenges by implementing modern, push-based edge collection architectures to replace legacy proxy models.
  2. Build Policy Enforcement Systems: Design and implement the technical infrastructure for data governance, policy engines, access control enforcement points, secure credential management, and audit logging. Looking for someone who has built governance controls into a platform, not just administered them.
  3. Focus on User Experience: Develop a modern, web interface and APIs that unify distinct observability signals into a seamless, consolidated user experience.
  4. Optimize Storage & Cost: Implement cost-effective tiered storage architectures. You will define strategies for routing high-volume data to cold storage solutions to reduce costs while maintaining multi-year data retention.
  5. Drive Platform Automation: Architect workflow orchestration systems to automate platform maintenance, data lifecycle management, and complex pipeline operations.

Skills

Required

  • BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • 8+ years of full-stack software development experience with a focus on Data Platforms or Infrastructure Tools
  • Strong Full-Stack Fluency: Proficiency in high-performance backend systems programming and modern frontend web frameworks for building responsive user interfaces (Python, JS, Java, Rust, Go, React, or similar)
  • Observability Expertise: Experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open-source tools. Hands-on experience operating and extending the Grafana Ecosystem or ELK stack at scale. You understand the internals of time-series databases and inverted indexes.
  • Infrastructure-as-Code: Experience deploying complex stateful services on Kubernetes using Helm, Terraform, or Ansible.
  • Streaming & Storage: Familiarity with event streaming and modern data lake formats

Nice to have

  • Experience writing Custom Grafana data source Plugins or backend plugins in Go.
  • Background with migrating legacy monoliths to microservices or Vector-based pipelines.
  • Experience with OpenTelemetry (OTEL) collector configuration, writing custom processors, or instrumentation SDKs.
  • Background in Data Governance, including implementation of Policy-as-Code or compliance frameworks in a regulated environment.

What the JD emphasized

  • built governance controls into a platform, not just administered them
  • Observability Expertise
  • Hands-on experience operating and extending the Grafana Ecosystem or ELK stack at scale
  • Experience with OpenTelemetry (OTEL) collector configuration, writing custom processors, or instrumentation SDKs
  • Background in Data Governance, including implementation of Policy-as-Code or compliance frameworks in a regulated environment