Software Manager, Infra Tools AI Team

NVIDIA NVIDIA · Semiconductors · Raanana, Israel

This role is for a Software Engineering Manager to lead a team focused on building AI/LLM-powered infrastructure and tooling for the SONiC Network OS development lifecycle. The team will develop LLM-based tools for failure analysis, test coverage generation, and accelerating product quality, aiming to improve regression analysis and test efficiency.

What you'd actually do

  1. Lead and mentor a team of infrastructure and tooling engineers; set technical direction, define priorities, and grow team capabilities
  2. Design, build, and maintain scalable infrastructure for development, integration, and test environments supporting SONiC OS.
  3. Architect and deliver LLM-based tools for intelligent regression analysis — failure classification, root cause clustering, anomaly detection, and test flakiness prediction
  4. Lead efforts to reduce regression runtime through parallelization, smart test selection, and dependency-aware scheduling
  5. Develop deep technical knowledge of SONiC Network OS internals, including its subsystem architecture, SAI/ASIC abstraction layer, and management plane

Skills

Required

  • 8+ overall years of software engineering experience
  • at least 3 years of experience in a leadership role, managing software development teams
  • Proven ability to lead technical teams: hiring, mentoring, technical roadmapping, and cross-team influence
  • Experienced with developing software testing tools and tests infrastructure
  • Strong Python programming skills
  • experience building production-quality automation frameworks and tooling
  • Demonstrated experience designing and operating CI/CD systems at scale (Jenkins, GitLab CI, GitHub Actions, or equivalent)
  • Hands-on experience with LLMs or AI-assisted developer tooling — building, integrating, or productizing AI capabilities in an engineering workflow
  • Strong analytical and problem-solving skills with a bias toward measurable outcomes and data-driven decisions

Nice to have

  • Deep Linux expertise: system internals, networking stack, process management, and scripting
  • Prior experience building LLM-powered test analysis pipelines or AI-enhanced DevOps tooling in a real production environment
  • Knowledge of networking protocols and hardware: Ethernet switching, L2/L3 protocols, QoS, VLANs, high-performance data center networking
  • Experience with code coverage instrumentation in large-scale C/Python codebases and using coverage data for test prioritization
  • Track record of measurably improving regression runtime, test reliability, or CI throughput in a complex embedded or systems software environment

What the JD emphasized

  • Proven ability to lead technical teams: hiring, mentoring, technical roadmapping, and cross-team influence
  • Hands-on experience with LLMs or AI-assisted developer tooling — building, integrating, or productizing AI capabilities in an engineering workflow
  • Prior experience building LLM-powered test analysis pipelines or AI-enhanced DevOps tooling in a real production environment

Other signals

  • LLM-based tools for intelligent regression analysis
  • AI and LLM capabilities to transform how we analyze failures
  • LLM-powered test analysis pipelines or AI-enhanced DevOps tooling