What you'd actually do

Developing, improving and optimizing scalable infrastructure for handling and deploying security and networking AI models and agents in production, ensuring high availability, scalability, reproducibility, and performance.

Optimizing AI models and agents for performance, scalability, and resource utilization, considering factors such as latency, efficiency, and cost.

Monitoring and deploying agentic systems, LLMs, and ML models in production.

Designing and implementing frameworks/pipelines for AI training, inference, and experimentation.

Collaborating closely with data scientists, security architects and software engineers to operationalize and deploy AI models and agents, including packaging and integration with existing systems. Participate in developing and reviewing code, design documents, use case reviews, and test plan reviews.

Skills

Required

Python
Java
Scala
TensorFlow
PyTorch
microservices architecture
container orchestration
cloud platforms
scalable infrastructure
inference optimization
CI/CD tools
GitLab
GitHub Actions
Jenkins

Nice to have

network protocols
Linux internals
security protocols
network architectures
firewalls
intrusion detection systems
generative models
network security principles

What the JD emphasized

at least 5 years of experience

deploying and monitoring AI/ML models, LLMs and agents to production systems at scale

Proficiency in microservices architecture, container orchestration, cloud platforms, and scalable infrastructure for training and inference workloads

Knowledge of inference optimization techniques

We're looking for a Senior AI/MLOps Engineer to join a group that specializes in Security and Networking, and specifically ML, AI and agent development. As a Senior AI/MLOps Engineer, you’ll build and maintain the infrastructure, tools and processes necessary to support the AI lifecycle in a production environment. You will collaborate closely with data scientists, software engineers, security architects and DevOps teams to ensure smooth deployment, modeling and optimization of AI models. This role involves creative problem solving alongside engineering teams, and is pivotal for the continued success of AI networking security.

What you’ll be doing:

Developing, improving and optimizing scalable infrastructure for handling and deploying security and networking AI models and agents in production, ensuring high availability, scalability, reproducibility, and performance.
Optimizing AI models and agents for performance, scalability, and resource utilization, considering factors such as latency, efficiency, and cost.
Monitoring and deploying agentic systems, LLMs, and ML models in production.
Designing and implementing frameworks/pipelines for AI training, inference, and experimentation.
Collaborating closely with data scientists, security architects and software engineers to operationalize and deploy AI models and agents, including packaging and integration with existing systems. Participate in developing and reviewing code, design documents, use case reviews, and test plan reviews.
Collaborating with DevOps teams to integrate pipelines and workflows into the CI/CD process, ensuring flawless deployments and rollbacks.
Building and maintaining monitoring and alerting systems to proactively identify and resolve issues relating to quality, performance and infrastructure.
Implementing access controls, authentication mechanisms, and encryption standards for AI models and data.
Documenting guidelines, and standard operating procedures for MLOps/AI processes and sharing knowledge with the wider team.
Develop proof-of-concepts for new features.

What we need to see:

BSc/MSc in CS/CE or related field (or equivalent experience)
Strong background in AI with experience deploying and monitoring AI/ML models, LLMs and agents to production systems at scale, including distributed and multi-node environments - at least 5 years of experience.
Proficiency in programming languages such as Python, Java, or Scala, along with experience in using ML/AI frameworks and libraries (e.g. TensorFlow, PyTorch).
Proficiency in microservices architecture, container orchestration, cloud platforms, and scalable infrastructure for training and inference workloads
Knowledge of inference optimization techniques.
Understanding of build infrastructure and CI/CD tools and practices (e.g. GitLab, GitHub Actions, Jenkins)
You are detail-oriented and care deeply about robust, well tested, high-performance code in production environments.
You are proactive, take full ownership of your deliverables, have a can-do approach, and excellent communication and collaboration skills, able to work effectively in multifunctional teams.

Ways to stand out from the crowd:

Knowledge of network protocols and Linux internals
Security and networking background, with knowledge of security protocols, network architectures, firewalls, intrusion detection systems, and other relevant security and networking concepts
Experience deploying and optimizing generative models and agents
Knowledge of network security principles and practices

NVIDIA has some of the most forward-thinking and hardworking people on the planet working for us and, due to unprecedented growth, our special engineering teams are growing fast. If you're a creative and autonomous engineer with a genuine passion for technology, we want to hear from you.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.