Senior Mlops Engineer, Genai Framework

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

This role focuses on building and maintaining CI/CD pipelines and release processes for NVIDIA's GenAI frameworks (Megatron-LM, NeMo). It involves implementing scalable DevOps solutions, managing infrastructure (Kubernetes, Docker, Slurm), automating tasks for research and development cycles, and developing quality control measures. The goal is to enable efficient work for GenAI software engineers, DL algorithm engineers, and research scientists, optimizing performance and ensuring high-quality software delivery.

What you'd actually do

  1. Develop and maintain the continuous integration pipelines and release processes of our Generative AI framework and libraries related to Megatron-LM and NeMo Framework.
  2. Implement efficient and scalable DevOps solutions to allow our fast growing team to release software more frequently while maintaining high-quality and maximum performance.
  3. Work with industry standard tools (Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, Jira) in hybrid on-premise and cloud environments.
  4. Assist with cluster operations and system administration (managing: servers, team accounts, clusters).
  5. Accelerate research and development cycles by automating recurring tasks such as accuracy and performance regression detection.

Skills

Required

  • BS or MS degree in Computer Science, Computer Architecture or related technical field (or equivalent experience) and 3+ years of industry experience in DevOps and infrastructure engineering.
  • Strong system level programming in languages like Python and shell scripting.
  • Experience with build/release systems and CI/CD with solutions like Gitlab, Github, Jenkins etc.
  • Experience with Linux system administration.
  • Experience with containerization and cluster management technologies like Docker and Kubernetes.
  • Experience in build tools, including Make, Cmake.
  • A strong background in source code management (SCM) solutions such as GitLab, GitHub, Perforce, etc.
  • Well-versed problem-solving and debugging skills.
  • Great teammate who can collaborate and influence others in a dynamic environment.
  • Excellent interpersonal and written communication skills.

Nice to have

  • Proven-track record with GPU accelerated systems at scale.
  • Well-versed in DL frameworks such as PyTorch, Jax, or TensorFlow.
  • Expertise in cluster and cloud compute technologies, e.g.: SLURM, Lustre, k8s
  • Software and hardware Benchmarking on high-performance computing systems.

What the JD emphasized

  • GenAI framework software engineers
  • deep learning algorithm engineers
  • research scientists
  • performance optimization
  • accuracy and performance regression detection

Other signals

  • GenAI Frameworks
  • LLM
  • Multimodal
  • Video Generation
  • end-to-end model training
  • deployment
  • performance optimization
  • DevOps tools
  • CI/CD