Build VLM-powered Visual AI Agents Using NVIDIA NIM and NVIDIA VIA Microservices

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA.

Traditional video analytics applications and their development workflow are typically built on fixed-function, limited models that are designed to detect and identify only a select set of predefined objects.

With generative AI, NVIDIA NIM microservices, and foundation models, you can now build applications with fewer models that have broad perception and rich contextual understanding.

The new class of generative AI models, vision language models (VLM), powers visual AI agents that can understand natural language prompts and perform visual question answering. These agents unlock entirely application possibilities for a wide range of industries. They significantly streamline app development workflows and also deliver transformative and new perception capabilities, such as image or video summarization, interactive visual Q&A, and visual alerts.

These visual AI agents will be deployed throughout factories, warehouses, retail stores, airports, traffic intersections, and more. They’ll help operations teams make better decisions using richer insights generated from natural interactions.

The NVIDIA NIM and NVIDIA VIA microservices are here to accelerate the development of visual AI agents. In this post, we show you how to seamlessly build an AI agent with these two technologies with a summarization microservice to help process large amounts of videos with VLMs and NIM microservices and produce curated summaries.

NVIDIA VIA uses the OpenAI GPT-4o model as the VLM by default.

Releasing NVIDIA VIA microservices

Build Visual AI Agents with Vision Language Models

NVIDIA VIA microservices, an extension of NVIDIA Metropolis Microservices, are cloud-native building blocks to accelerate the development of visual AI agents powered by VLMs and NIM whether deployed at the edge or cloud. The NVIDIA VIA microservices are available now for download in developer preview.

The opportunities for building visual AI agents for new use cases with NVIDIA VIA are endless. These modular microservices give you the flexibility to build visual AI agents and customize them to add sophisticated features.

NVIDIA VIA microservices provide modular architecture and customizable model support. It supports recorded videos as well as live streams and has a REST API for ease of integration in existing systems as well as a UI for quick tryouts.

Each NVIDIA VIA microservice is a single container with no dependencies on other containers or microservices. NVIDIA VIA can be easily deployed to stand-alone machines, on-premises, cloud, or any CSPs.

NVIDIA VIA microservices integration with NVIDIA NIM

NVIDIA VIA microservices can be easily integrated with NVIDIA NIM. You have the flexibility to use any LLM or VLM models from the NVIDIA API Catalog of model preview APIs and downloadable NIM microservices.

By default, NVIDIA VIA uses the OpenAI GPT-4o model as the VLM. For this post, we used the NVIDIA VITA-2.0 model as the VLM available in NGC.

NVIDIA VIA uses the NVIDIA-hosted Llama 3 70B NIM microservice as the LLM for NVIDIA NeMo Guardrails and the Context-Aware RAG (CA-RAG) module. You can choose from a wide range of different LLMs and VLMs from the API Catalog, either NVIDIA-hosted or locally deployed.

NVIDIA NIM is a set of microservices that includes industry-standard APIs, domain-specific code, optimized inference engines, and enterprise runtime. It delivers multiple VLMs for building a visual AI agent that can process live or archived images or videos to extract actionable insight using natural language.

Visual AI agent for summarization built using VIA microservices

Most VLMs today accept only a limited number of frames, for example, 8 / 10 / 100. They also can’t accurately generate captions for longer videos. For longer videos such as hour-long videos, sampled frames could be 10s of seconds apart or even longer. This can result in some details getting missed or actions not getting recognized.

A solution to the problem is to create smaller chunks from long videos, analyze the chunks individually using VLMs, and then summarize and aggregate results to generate a single summary for the entire file.

Build Visual AI Agents for Video Summarization Using NVIDIA VIA Microservices

Figure 1. High-level architecture of the summarization vision AI agent

The summarization agent consists of the following components:

NVIDIA VIA stream handler: Manages the interaction and synchronization with the other components such as NeMo Guardrails, CA-RAG, the VLM pipeline, chunking, and the Milvus Vector DB.
NeMo Guardrails: Filters out invalid user prompts. It makes use of the REST API of an LLM NIM microservice.
VLM pipeline – Decodes video chunks generated by the stream handler, generates the embeddings for the video chunks using an NVIDIA Tensor RT-based visual encoder model, and then makes use of a VLM to generate per-chunk response for the user query. It is based on the NVIDIA DeepStream SDK.
VectorDB: Stores the intermediate per-chunk VLM response.
CA-RAG module: Extracts useful information from the per-chunk VLM response and aggregates it to generate a single unified summary. CA-RAG (Context Aware-Retrieval-Augmented Generation) uses the REST API of an LLM NIM microservice.

NVIDIA VIA microservices come with several features:

Summarization for videos and live streams
Optimal and highly scalable implementation on multiple GPUs
Better summarization with the CA-RAG module
Enabling summarization for any use case

Summarization for videos and live streams

With NVIDIA VIA, you can easily summarize long video files as well as live streams with the REST API. NVIDIA VIA takes care of all the heavy lifting while providing a lot of configurable parameters.

For file summarization, NVIDIA VIA splits the input file into chunks based on the user-configured chunk duration, chunk overlap duration, and file duration.

For example, for an hour-long file with a minute chunk duration, it generates 60 chunks. These chunks are processed in parallel by the VLM pipeline. When all chunk captions are available, CA-RAG summarizes and aggregates these captions to generate a single final summary for the file.

For live streams, a streaming pipeline receives streaming data from the RTSP server. The NVIDIA VIA microservice continuously generates video-chunk segments based on the user-configured chunk duration. The VLM pipeline then generates the captions for these chunks.

The NVIDIA VIA engine keeps on gathering the captions from the VLM pipeline. When enough chunks are processed based on the user-configured summary duration, the chunks gathered are sent to CA-RAG for summarization and aggregation. The VIA engine continues processing the next chunks. The summaries are streamed to the client using HTTP server-sent events.

Optimal and highly scalable implementation on multiple GPUs

Based on the use case for video files and live streams, the video content, and VLM models, you can configure chunking parameters.

Video files

chunk_duration: The entire video is divided into chunk_duration length segments, N (VLM-dependent) frames are sampled from this chunk and sent to VLM for inference. The chunk duration should be small enough that the N frames can capture the event.

chunk_overlap: If an event occurs at the chunk intersection, then the sampled frames might not capture the complete event and the model can’t detect it. VIA alleviates this problem by using a sliding window approach where chunk_overlap is the overlap duration between the chunks. (Default: 0).

Streams

chunk_duration: Similar to video files, the live stream is divided into segments of chunk_duration and sent to VLM for inference and the chunk duration should be small enough that the N frames can capture the event.

summary_duration: The duration for which the user wants a summary. This enables the user to control the duration of the stream for which the summary should be produced. For instance, if chunk_duration is 1 min and the summary duration is 30 min., then the stream is divided into 1-min. chunks for VLM inference. The VLM output of 30 chunks is aggregated to provide the user with a 30-min. concise summary.

Example chunking configurations

Example chunking configuration for some sample use cases, for video files:

tail-gating detection:
chunk_duration: 2 min
chunk_overlap: 15 sec

traffic violation (such as a wrong turn):
chunk_duration: 30 sec
chunk_overlap: 15 sec

Example chunking configuration for some sample use cases, for streams:

sports summarization:
chunk_duration: 2 min
summary_duration: 15 min

robot control:
chunk_duration: 5 sec
summary_duration: 5 sec

These are just guidelines and the actual parameter must be tuned by the user for their use case. It’s a tradeoff between accuracy and performance. Smaller chunk sizes result in better descriptions but take longer to process.

NVIDIA VIA supports multiple GPUs on a single node. It can efficiently scale by distributing the chunks across multiple GPUs and processing these chunks in parallel.

For VITA-2.0, NVIDIA VIA uses TensorRT-LLM acceleration for better performance. It can also make use of multiple NVDEC engines on a single GPU, thus accelerating the decoding of the video file. With scaling on NVIDIA VIA, you can process an hour-long file in just a few minutes depending on system and GPU configuration.

Better summarization with the Context-Aware RAG module

NVIDIA VIA includes the CA-RAG module for better summarization results. CA-RAG is responsible for extracting useful information from the per-chunk VLM captions and aggregating and summarizing them.

You can configure various aspects of CA-RAG:

Summarization methods
The LLM model to use and its parameters
LLM prompts to change the response format
And more

CA-RAG is based on LangChain and can be extended.

Enabling summarization for any use case

NVIDIA VIA summarization microservices offer flexible solutions for various use cases by modifying prompts, models, chunk parameters, and more. The prompts fall into two categories:

VLM prompt: The user can specify what details, events, or actions to extract from the video chunks.
LLM prompt: The user can specify how the generated VLM response should be combined to create the final summary.

We encourage you to experiment with different prompts and chunk lengths to optimize performance and achieve the best results.

Performance

Figure 2 shows the end-to-end summarization time after the video is uploaded, for various lengths of videos. The four plots are for varying chunk sizes that are used for captioning (lower is better). Summarization of 50m takes 50s using a 60s captioning chunk size. The summarization application here uses the NVIDIA VITA-2.0 model, which is available on NGC.

Figure 2. Performance of a summarization agent using NVIDIA VIA microservices on an 8x H100 GPU system

Visual AI agents using NVIDIA VIA microservices have been verified on the following NVIDIA GPUs:

A100
H100
L40 and L40s
A6000

They can also be deployed on other GPU platforms with the NVIDIA Ampere, NVIDIA Hopper, and NVIDIA Ada Lovelace architectures.

Getting started with NVIDIA VIA microservices

Build powerful VLM-based AI agents using NVIDIA VIA microservices and NVIDIA NIM. REST APIs provide ease of integration of this workflow and VLMs in existing customer applications.

For more information, see the following resources:

NVIDIA VIA microservices, now available in developer preview
NVIDIA VIA forum
Visual AI Agents solution page
Harnessing Generative AI and Large Language Model with Vision AI Agents (GTC session)

Related resources

NGC Containers: MolMIM
NGC Containers: VLM Inference Service (Jetson)
NGC Containers: NMT Megatron Riva 1b
SDK: NeMo Inferencing Microservice
SDK: NeMo LLM Service
SDK: NeMo Retriever

Shaunak Gupte
Senior Software Engineer, Intelligent Video Analytics group, NVIDIA

Shivam Lakhotia
Senior Software Engineer, Intelligent Video Analytics group, NVIDIA

Unnikrishnan Sreekumar
Senior Software Engineer, DeepStream SDK team, NVIDIA

Tushar Khinvasara
Principal Software Engineer, Intelligent Video Analytics group, NVIDIA

Ashwani Agarwal
Senior Software Engineer, Intelligent Video Analytics group, NVIDIA

Prashant Gaikwad
Senior System Software Manager, NVIDIA

Bhushan Rupde
Manager, Intelligent Video Analytics group, NVIDIA

Kaustubh Purandare
Senior Director, Software Engineering, NVIDIA

If you're building AI or vision-enabled products, you've come to the right place.