Senior Software Engineer, at Scale Compute Analysis

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4 · Remote

This role involves analyzing large-scale datacenter workloads on GPU-accelerated clusters, turning telemetry and workload data into findings and visuals. It requires partnering with various engineering teams and applying ML/DL techniques for categorization and forecasting, integrating these into tools the team uses. The role emphasizes analyzing complex datasets, debugging data issues, communicating trends, and building practical visualizations and lightweight ML/DL implementations within existing workflows.

What you'd actually do

  1. Analyze large-scale workloads and infrastructure signals to find application and platform improvement opportunities.
  2. Work with high-dimensional data: spot trends, tie changes to known events, summarize conclusions, and communicate results to engineers and leadership.
  3. Partner with the team to clarify questions, scope analyses, and document methods so others can extend your work.
  4. Build and maintain practical visualizations and lightweight implementations (e.g. ML/DL for classification/prediction) inside existing software workflows.

Skills

Required

  • Python
  • JavaScript
  • telemetry / observability stacks
  • core ML concepts
  • analytical and problem-solving skills
  • collaboration and communication

Nice to have

  • TensorFlow
  • PyTorch
  • Linux
  • HPC / large-scale or performance-sensitive environments
  • visualizing high-dimensional problems

What the JD emphasized

  • 5+ years analyzing complex datasets, debugging data issues, and communicating trends clearly.
  • Hands-on use of telemetry / observability stacks (e.g. Grafana, Elasticsearch, Splunk).
  • Shown grasp of core ML concepts

Other signals

  • apply machine learning and deep learning techniques for categorization and forecasting
  • Build and maintain practical visualizations and lightweight implementations (e.g. ML/DL for classification/prediction) inside existing software workflows
  • Hands-on use of telemetry / observability stacks
  • Shown grasp of core ML concepts