Principal Data Platform Architect

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

This role is for a Principal Data Platform Architect at NVIDIA, focusing on defining and implementing a vision for distributed data platform and observability systems for large-scale AI and HPC clusters. The architect will design systems for data collection, aggregation, storage, retrieval, and visualization to enhance the efficiency and performance of AI/HPC workloads, and lead teams in developing and deploying observability solutions.

What you'd actually do

  1. Collaborate with AI, HW, and SW engineering and research teams to define a vision and roadmap for AI/HPC cluster observability.
  2. Architect and lead teams to develop, test, and deploy data collectors, pipelines, visualization and retrieval services.
  3. Define data collection and retention polices to balance network bandwidth, system load, and storage capacity costs with data analysis requirements.
  4. Work in a diverse team to provide operational and strategic data to empower our engineers and researchers to improve performance, productivity, and efficiency.
  5. Continuously improve quality, workloads, and processes through better observability.

Skills

Required

  • Python
  • JS
  • Java
  • databases (relational and non-relational)
  • Apache Spark
  • Elastic/Open Search
  • Grafana
  • Prometheus

Nice to have

  • computer science
  • machine learning
  • deep learning
  • open-source software
  • infrastructure technologies
  • GPU technology
  • infrastructure software
  • production application software development
  • software development
  • release and support methodology
  • devops
  • management of datacenters
  • large-scale distributed computing
  • AI researchers
  • EDA developers
  • driving process improvements
  • measuring efficiency
  • sharing knowledge and experience
  • driving complex projects end-to-end

What the JD emphasized

  • Experience designing and building large scale, distributed observability systems.
  • Experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open-source tools
  • 15+ years of relevant experience.