Senior System Software Engineer - AI Data Platform - Inference Factory Optimization

NVIDIA · Semiconductors · Hanoi, Vietnam +1 · Remote

Senior Software Engineer focused on building and optimizing infrastructure for automating the deployment and performance tuning of NVIDIA's AI software offerings, impacting inference applications across various hardware.

What you'd actually do

  1. Develop efficient infrastructure and tools for automating complex software processes.
  2. Implement advanced test harnesses, benchmarking frameworks, and analytical tools to rigorously characterize and optimize the performance and efficiency of our software and hardware platforms.
  3. Apply deep knowledge of operating systems, kernel internals, device drivers, memory management, storage, networking, and high-speed interconnects to build and troubleshoot highly performant systems.
  4. Work with engineering teams to understand needs, define requirements, and deliver efficient solutions.
  5. Set performance goals, monitor feedback, analyze data, and make continuous improvements for system reliability.

Skills

Required

  • C++
  • Python
  • Go
  • operating system internals
  • device drivers
  • memory management
  • distributed systems
  • networking protocols
  • cluster management
  • high-performance interconnects
  • automation
  • CI/CD
  • performance engineering

Nice to have

  • AI/Machine Learning workloads optimization
  • inference applications optimization
  • containerization
  • Kubernetes
  • performance profiling tools

What the JD emphasized

  • 5+ years of industry experience in software development, focusing on infrastructure, distributed systems, automation, and/or performance engineering.
  • Proven ability to develop robust tools and automation using programming languages such as C++, Python, or Go.
  • Experience with operating system internals, device drivers, memory management, and debugging performance issues in complex compute applications.
  • Experience in designing, building, and operating large-scale distributed systems, with knowledge of networking protocols, cluster management, and high-performance interconnects.
  • Experience building and maintaining automated testing, benchmarking, and continuous integration/continuous deployment pipelines.

Other signals

  • optimizing inference performance
  • building foundational infrastructure for AI
  • automating AI model deployment