Senior Software Engineer, Aiops

NVIDIA · Semiconductors · Raanana, Israel +1

NVIDIA is seeking a Senior Software Engineer for their AIOps platform team to build core distributed systems for ingesting telemetry from GPU clusters and operationalizing predictive AI models. The role involves architecting an agentic AIOps system, handling high-scale data engineering, and building model-serving infrastructure for SaaS and on-premises deployments.

What you'd actually do

Architect and build an agentic AIOps system that autonomously monitors GPU fleet health, aggregates and correlates massive telemetry streams, surfaces intelligent alerts, and orchestrates multi-step diagnostic workflows and corrective actions - powering real-time dashboards, automated root-cause analysis, and proactive incident response.
Research, evaluate, and prototype data storage strategies and data representations across diverse database technologies and modalities, ensuring AI models are trained on high-quality, well-structured data that improves predictive accuracy and generalization.
High-Scale Engineering: Design distributed systems to handle the extreme telemetry density of large-scale AI clusters, ensuring efficient data ingestion, processing, and real-time analysis.
Instrument services with deep observability (metrics, logs, traces) to support rapid debugging and continuous performance improvement.
Build and own the model-serving infrastructure that operationalizes predictive algorithms at scale - packaging, versioning, deploying, and monitoring AI models in both SaaS and on-premises environments.

Skills

Required

B.Sc./M.Sc. in Computer Science, Computer Engineering, or a related technical field
8+ years of software engineering experience building production distributed systems
Expert-level proficiency in languages such as Go, C++, or Rust, with a focus on high-performance, concurrent architectures
Solid understanding of Kubernetes and container-based deployments for production services
Experience deploying, monitoring, and maintaining ML models or data-intensive services in a production environment
Comfort working in ambiguous, fast-moving environments where the product is still being shaped

Nice to have

Experience building ML model-serving platforms or MLOps tooling (model registries, A/B rollout frameworks, feature stores) at scale
A track record of taking systems from prototype to stable, production-grade platform serving real enterprise customers
A "Systems" Thinker: You don't just write software; you understand the full stack, from how data moves across the wire to how it’s processed in a distributed cluster.
Practical Innovation: The ability to simplify complex problems and build internal tools or frameworks that empower other engineering teams to move faster.

What the JD emphasized

mission-critical
operationalize predictive AI models at scale
agentic AIOps system
multi-step diagnostic workflows
High-Scale Engineering
extreme telemetry density
real-time dashboards
automated root-cause analysis
proactive incident response
production distributed systems
high-performance, concurrent architectures
production environment
ambiguous, fast-moving environments
ML model-serving platforms
MLOps tooling
stable, production-grade platform
enterprise customers

Other signals

operationalize predictive AI models at scale
agentic AIOps system
model-serving infrastructure

Read full job description

NVIDIA is powering the world's most advanced AI Factories. To ensure their seamless operation, we are building a mission-critical Observability and Prediction platform - delivered as both a high-scale SaaS solution and a robust on-premises deployment for our largest enterprise customers.

We are looking for a Senior Software Engineer to join the AIOps platform team and help build the core distributed systems that ingest massive telemetry streams from GPU clusters and operationalize predictive AI models at scale. You will work at the intersection of high-performance data engineering and production ML, turning research algorithms into reliable, mission-critical software.

What you'll be doing:

Architect and build an agentic AIOps system that autonomously monitors GPU fleet health, aggregates and correlates massive telemetry streams, surfaces intelligent alerts, and orchestrates multi-step diagnostic workflows and corrective actions - powering real-time dashboards, automated root-cause analysis, and proactive incident response.
Research, evaluate, and prototype data storage strategies and data representations across diverse database technologies and modalities, ensuring AI models are trained on high-quality, well-structured data that improves predictive accuracy and generalization.
High-Scale Engineering: Design distributed systems to handle the extreme telemetry density of large-scale AI clusters, ensuring efficient data ingestion, processing, and real-time analysis.
Instrument services with deep observability (metrics, logs, traces) to support rapid debugging and continuous performance improvement.
Build and own the model-serving infrastructure that operationalizes predictive algorithms at scale - packaging, versioning, deploying, and monitoring AI models in both SaaS and on-premises environments.
Contribute to the platform's core libraries and abstractions that accelerate development across the broader AIOps engineering team.

What we need to see:

B.Sc./M.Sc. in Computer Science, Computer Engineering, or a related technical field.
8+ years of software engineering experience building production distributed systems.
Core Systems Programming: Expert-level proficiency in languages such as Go, C++, or Rust, with a focus on high-performance, concurrent architectures.
Solid understanding of Kubernetes and container-based deployments for production services.
Experience deploying, monitoring, and maintaining ML models or data-intensive services in a production environment.
Comfort working in ambiguous, fast-moving environments where the product is still being shaped.

Ways to stand out from the crowd:

Experience building ML model-serving platforms or MLOps tooling (model registries, A/B rollout frameworks, feature stores) at scale.
A track record of taking systems from prototype to stable, production-grade platform serving real enterprise customers.
A "Systems" Thinker: You don't just write software; you understand the full stack, from how data moves across the wire to how it’s processed in a distributed cluster.
Practical Innovation: The ability to simplify complex problems and build internal tools or frameworks that empower other engineering teams to move faster.

With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you are passionate about building mission-critical systems at the frontier of AI infrastructure, we want to hear from you.

What you'll be doing:

Architect and build an agentic AIOps system that autonomously monitors GPU fleet health, aggregates and correlates massive telemetry streams, surfaces intelligent alerts, and orchestrates multi-step diagnostic workflows and corrective actions - powering real-time dashboards, automated root-cause analysis, and proactive incident response.
Research, evaluate, and prototype data storage strategies and data representations across diverse database technologies and modalities, ensuring AI models are trained on high-quality, well-structured data that improves predictive accuracy and generalization.
High-Scale Engineering: Design distributed systems to handle the extreme telemetry density of large-scale AI clusters, ensuring efficient data ingestion, processing, and real-time analysis.
Instrument services with deep observability (metrics, logs, traces) to support rapid debugging and continuous performance improvement.
Build and own the model-serving infrastructure that operationalizes predictive algorithms at scale - packaging, versioning, deploying, and monitoring AI models in both SaaS and on-premises environments.
Contribute to the platform's core libraries and abstractions that accelerate development across the broader AIOps engineering team.

What we need to see:

B.Sc./M.Sc. in Computer Science, Computer Engineering, or a related technical field.
8+ years of software engineering experience building production distributed systems.
Core Systems Programming: Expert-level proficiency in languages such as Go, C++, or Rust, with a focus on high-performance, concurrent architectures.
Solid understanding of Kubernetes and container-based deployments for production services.
Experience deploying, monitoring, and maintaining ML models or data-intensive services in a production environment.
Comfort working in ambiguous, fast-moving environments where the product is still being shaped.

Ways to stand out from the crowd:

Experience building ML model-serving platforms or MLOps tooling (model registries, A/B rollout frameworks, feature stores) at scale.
A track record of taking systems from prototype to stable, production-grade platform serving real enterprise customers.
A "Systems" Thinker: You don't just write software; you understand the full stack, from how data moves across the wire to how it’s processed in a distributed cluster.
Practical Innovation: The ability to simplify complex problems and build internal tools or frameworks that empower other engineering teams to move faster.