Senior Software Engineer, Aiops

NVIDIA NVIDIA · Semiconductors · Raanana, Israel +1

NVIDIA is seeking a Senior Software Engineer for their AIOps platform team to build core distributed systems for ingesting telemetry from GPU clusters and operationalizing predictive AI models. The role involves architecting an agentic AIOps system, handling high-scale data engineering, and building model-serving infrastructure for SaaS and on-premises deployments.

What you'd actually do

  1. Architect and build an agentic AIOps system that autonomously monitors GPU fleet health, aggregates and correlates massive telemetry streams, surfaces intelligent alerts, and orchestrates multi-step diagnostic workflows and corrective actions - powering real-time dashboards, automated root-cause analysis, and proactive incident response.
  2. Research, evaluate, and prototype data storage strategies and data representations across diverse database technologies and modalities, ensuring AI models are trained on high-quality, well-structured data that improves predictive accuracy and generalization.
  3. High-Scale Engineering: Design distributed systems to handle the extreme telemetry density of large-scale AI clusters, ensuring efficient data ingestion, processing, and real-time analysis.
  4. Instrument services with deep observability (metrics, logs, traces) to support rapid debugging and continuous performance improvement.
  5. Build and own the model-serving infrastructure that operationalizes predictive algorithms at scale - packaging, versioning, deploying, and monitoring AI models in both SaaS and on-premises environments.

Skills

Required

  • B.Sc./M.Sc. in Computer Science, Computer Engineering, or a related technical field
  • 8+ years of software engineering experience building production distributed systems
  • Expert-level proficiency in languages such as Go, C++, or Rust, with a focus on high-performance, concurrent architectures
  • Solid understanding of Kubernetes and container-based deployments for production services
  • Experience deploying, monitoring, and maintaining ML models or data-intensive services in a production environment
  • Comfort working in ambiguous, fast-moving environments where the product is still being shaped

Nice to have

  • Experience building ML model-serving platforms or MLOps tooling (model registries, A/B rollout frameworks, feature stores) at scale
  • A track record of taking systems from prototype to stable, production-grade platform serving real enterprise customers
  • A "Systems" Thinker: You don't just write software; you understand the full stack, from how data moves across the wire to how it’s processed in a distributed cluster.
  • Practical Innovation: The ability to simplify complex problems and build internal tools or frameworks that empower other engineering teams to move faster.

What the JD emphasized

  • mission-critical
  • operationalize predictive AI models at scale
  • agentic AIOps system
  • multi-step diagnostic workflows
  • High-Scale Engineering
  • extreme telemetry density
  • real-time dashboards
  • automated root-cause analysis
  • proactive incident response
  • production distributed systems
  • high-performance, concurrent architectures
  • production environment
  • ambiguous, fast-moving environments
  • ML model-serving platforms
  • MLOps tooling
  • stable, production-grade platform
  • enterprise customers

Other signals

  • operationalize predictive AI models at scale
  • agentic AIOps system
  • model-serving infrastructure