What you'd actually do

Integrate TensorRT-LLM for BioNeMo models (Boltz1–2, OpenFold2–3) and upcoming structural biology models (RFDiffusion, DiffDock, ProteinNMN, Evo2, ESM3).

Optimize models for low-latency, high-throughput inference using parallelism, quantization (FP8/INT8), and sparsity/pruning.

Profile and debug deep learning workloads on GPUs, resolving kernel/graph bottlenecks in training/inference, including custom operators.

Develop and validate custom GPU kernels (CUDA, Triton) for hot paths, memory-bound ops, and non-standard blocks in structural biology models.

Collaborate with research to align model architecture and training with deployment constraints for smooth production transition.

Skills

Required

MS/PhD in CS, EE, Comp. Eng., or equivalent practical experience.
5+ years professional experience in deep learning/applied ML
Strong foundation in transformer/diffusion architectures
Proficient in PyTorch (and/or TensorFlow) for production-grade model building, debugging, and deployment.
Strong Python/C++
Practical experience with TensorRT/TensorRT-LLM
Familiarity with GPU performance engineering

Nice to have

LLMs, VLMs, or large biology models (e.g., structure prediction)
read/modify performance-critical C++/CUDA code for inference stacks and custom ops
profiling (Nsight), roofline analysis, and optimization of kernels/memory access
experience writing/extending custom GPU kernels for model hot paths

Join NVIDIA as a Senior Deep Learning Algorithms Engineer to optimize cutting-edge biology and structural biology models, including LLMs and VLMs, for maximum performance and efficiency on NVIDIA GPUs. Focus on world-class inference for workloads like protein structure prediction and design.

As part of BioNeMo, you will collaborate across teams to move next-gen AI models (e.g., Boltz1/2, OpenFold2/3) from research to production serving via TensorRT-LLM and related stacks, ensuring industry-leading, scalable performance for scientists and developers.

What you will be doing:

Integrate TensorRT-LLM for BioNeMo models (Boltz1–2, OpenFold2–3) and upcoming structural biology models (RFDiffusion, DiffDock, ProteinNMN, Evo2, ESM3).
Optimize models for low-latency, high-throughput inference using parallelism, quantization (FP8/INT8), and sparsity/pruning.
Profile and debug deep learning workloads on GPUs, resolving kernel/graph bottlenecks in training/inference, including custom operators.
Develop and validate custom GPU kernels (CUDA, Triton) for hot paths, memory-bound ops, and non-standard blocks in structural biology models.
Collaborate with research to align model architecture and training with deployment constraints for smooth production transition.

What we want to see:

MS/PhD in CS, EE, Comp. Eng., or equivalent practical experience.
5+ years professional experience in deep learning/applied ML, with a track record of deploying optimized models/inference paths in production (not research prototypes).
Strong foundation in transformer/diffusion architectures; direct experience with LLMs, VLMs, or large biology models (e.g., structure prediction).
Proficient in PyTorch (and/or TensorFlow) for production-grade model building, debugging, and deployment.
Strong Python/C++; ability to read/modify performance-critical C++/CUDA code for inference stacks and custom ops.
Practical experience with TensorRT/TensorRT-LLM: model conversion, optimization, deployment, and performance measurement (latency/throughput) under realistic conditions.
Familiarity with GPU performance engineering: profiling (Nsight), roofline analysis, and optimization of kernels/memory access; experience writing/extending custom GPU kernels for model hot paths is required.

Ways to stand out from the crowd:

Led or significantly contributed to large-scale LLM/VLM/biology model serving (strict SLOs, high QPS, multi-GPU/node inference, cost/perf ownership).
Deep customization of, or substantial contributions to, TensorRT-LLM, vLLM, SGLang, or comparable stacks, including debugging and extending for novel architectures.
End-to-end ownership of FP8/INT8 (or other formats), including calibration, regression testing, and documenting accuracy vs. speed tradeoffs on biology workloads.
Strong familiarity with protein structure, docking, or diffusion-based design and model families (e.g., OpenFold, Boltz, ESM, RFDiffusion, DiffDock)—demonstrated by benchmarks, publications, or open-source work.
Repeated success taking non-text architectures (geometric, multimodal, structure-centric) from research/checkpoint to optimized, production-ready inference with clear metrics as well as examples of writing, maintaining, or upstreaming custom kernels or fused ops that produced measurable gains on real models or hardware.

What you will be doing:

Integrate TensorRT-LLM for BioNeMo models (Boltz1–2, OpenFold2–3) and upcoming structural biology models (RFDiffusion, DiffDock, ProteinNMN, Evo2, ESM3).
Optimize models for low-latency, high-throughput inference using parallelism, quantization (FP8/INT8), and sparsity/pruning.
Profile and debug deep learning workloads on GPUs, resolving kernel/graph bottlenecks in training/inference, including custom operators.
Develop and validate custom GPU kernels (CUDA, Triton) for hot paths, memory-bound ops, and non-standard blocks in structural biology models.
Collaborate with research to align model architecture and training with deployment constraints for smooth production transition.

What we want to see:

MS/PhD in CS, EE, Comp. Eng., or equivalent practical experience.
5+ years professional experience in deep learning/applied ML, with a track record of deploying optimized models/inference paths in production (not research prototypes).
Strong foundation in transformer/diffusion architectures; direct experience with LLMs, VLMs, or large biology models (e.g., structure prediction).
Proficient in PyTorch (and/or TensorFlow) for production-grade model building, debugging, and deployment.
Strong Python/C++; ability to read/modify performance-critical C++/CUDA code for inference stacks and custom ops.
Practical experience with TensorRT/TensorRT-LLM: model conversion, optimization, deployment, and performance measurement (latency/throughput) under realistic conditions.
Familiarity with GPU performance engineering: profiling (Nsight), roofline analysis, and optimization of kernels/memory access; experience writing/extending custom GPU kernels for model hot paths is required.

Ways to stand out from the crowd:

Led or significantly contributed to large-scale LLM/VLM/biology model serving (strict SLOs, high QPS, multi-GPU/node inference, cost/perf ownership).
Deep customization of, or substantial contributions to, TensorRT-LLM, vLLM, SGLang, or comparable stacks, including debugging and extending for novel architectures.
End-to-end ownership of FP8/INT8 (or other formats), including calibration, regression testing, and documenting accuracy vs. speed tradeoffs on biology workloads.
Strong familiarity with protein structure, docking, or diffusion-based design and model families (e.g., OpenFold, Boltz, ESM, RFDiffusion, DiffDock)—demonstrated by benchmarks, publications, or open-source work.
Repeated success taking non-text architectures (geometric, multimodal, structure-centric) from research/checkpoint to optimized, production-ready inference with clear metrics as well as examples of writing, maintaining, or upstreaming custom kernels or fused ops that produced measurable gains on real models or hardware.

Senior Deep Learning Algorithms Engineer - Bionemo

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals