This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA.
Vision language models (VLMs) are evolving at a breakneck speed. In 2020, the first VLMs revolutionized the generative AI landscape by bringing visual understanding to large language models (LLMs) through the use of a vision encoder. These initial VLMs were limited in their abilities, only able to understand text and single image inputs.
Fast-forward a few years and VLMs are now capable of understanding multi-image and video inputs to perform advanced vision-language tasks such as visual question-answering (VQA), captioning, search, and summarization.
Figure 1. Suite of model customization tools in NVIDIA NeMo
VLM accuracy can be improved for specific use cases by tuning the prompts and model weights. Fine-tuning the model weights can be done with efficient techniques such as PEFT but still requires sufficient data and compute.
However, prompt engineering and in context learning can also be used to improve the output quality and is as simple as adjusting the text input to the model at runtime.
This post walks you through the evolution of VLMs and how to effectively prompt a VLM for single image, multi-image, and video understanding use cases. For more information about VLMs and visual AI agents, register for the Vision for All: Unlocking Video Analytics with AI Agents webinar.
Single-image understanding
From a single image, a VLM is capable of identifying, describing, classifying, and reasoning over what is in the image. Some VLMs have advanced capabilities such as providing bounding box coordinates for certain objects, and translating text in the image. This works best for use cases where an image needs to be classified, captioned or described in text.
Single-image capability can also be used to detect basic events occurring in a livestream. This can be accomplished by sampling one frame at a time from a livestream and providing it to the VLM along with a prompt describing what to look for. This works especially well when the types of events to detect are large, clearly visible, and slow-moving.
The VLM can also be prompted to output in a structured format such as JSON so that the response can easily be parsed and sent to a database or a notification service.
For example, we provided Llama-3.2-90b-vision-instruct with the following prompt and image (Figure 2).
Figure 2. Input image to the VLM for question and answering
Prompt: Is there a fire truck? Is there a fire? Are there firefighters? Output the answer to each question in JSON format.
VLM response:
[ { "question": "Is there a fire truck?", "answer": "True" },
{ "question": "Is there a fire?", "answer": "True" },
{ "question": "Are there firefighters?", "answer": "True" }
]
For each of the questions, the VLM was able to accurately respond and output it in a structured format to be used for downstream tasks.
While this method enables the VLM to operate on video input by sampling one frame at a time, the VLM itself does not understand the full context of the video, limiting it to simple use cases. Sampling one frame at a time does not give it the temporal understanding needed to detect actions or trends that occur over time. This can be improved with multi-image and video VLMs.
Multi-image understanding
Many VLMs can accept multiple images as input allowing the model to compare, contrast, and learn from input images.
VLMs generally have good zero-shot capability across a wide range of use cases. However, when trying to obtain more precise answers from a VLM on domain-specific use cases, VLMs can struggle because they lack the full context required to accurately respond to the user prompts.
To better understand this, we used a VLM to estimate the stock level of products. This is an important retail use case as it could autonomously monitor and alert employees when products are running low in a store to avoid lost revenues from stock-outs.
We provided a single image of a table (Figure 3) that should be fully stocked with several snack products to Cosmos Nemotron 34B and asked it to estimate the stock level.
Figure 3. Single-image VLM input for retail stock level detection
Prompt: Estimate the stock level of the snack table on a scale of 0–100.
VLM response:
The snack table appears to be well-stocked, with a variety of snacks and treats. The shelves are filled with an assortment of items, including chips, cookies, and other snacks. The table is also neatly organized, with each item placed in a designated spot. Based on the visual, I would rate the stock level at around 80-90, indicating that the table is well-stocked and ready to cater to a crowd.
The model rates it as 80–90% stocked even though it’s quite clear to the human eye that the stock level is low. From the model’s perspective, it does not know what the table looked like beforehand when looking at a single image.
Using the Cosmos Nemotron 34B multi-image capability, we provided another image of the snack table at a 100% stock level (Figure 4), giving the model extra context to produce a more accurate estimate.
Figure 4. In-context learning input to VLM for retail stock level detection
Prompt: First compare and contrast the stock level of the two images. Then generate an estimate for each image of the stock level on a scale of 0–100%.
VLM response:
The first image shows a well-stocked display with a variety of snacks, while the second image shows a more sparse display with only a few items. Based on the visual comparison, I estimate that the first image has a stock level of around 80-90%, while the second image has a stock level of around 20-30%.
By providing the additional image as a reference, the VLM now estimates the stock level of the second image at 20–30%, which is significantly more accurate.
Multi-image input can also be used in more complex multimodal RAG pipelines where example images are dynamically added to the prompt based on the user’s input. Multi-image VLMs can also be used for basic video understanding by sampling a small set of frames from a video and providing it to the VLM.
However, many VLMs that accept multi-image input have limited context windows (10–20 frames) and may not have been trained on video data. To truly understand video input, we recommend using a VLM trained on video-specific datasets and a long context window.
Video understanding
The latest VLMs have full video understanding capabilities with large context lengths, frame compression and video training data. This enables VLMs to ingest longer videos and have improved temporal understanding: the ability to understand actions and trends over time.
The first example showed how to use a single frame from a livestream to detect a fire but that same method wouldn’t be able to tell you if the fire is growing or getting smaller because it lacks the temporal context.
A VLM with long context and video understanding processes many frames across time to understand if the fire is growing or not. Here are some examples from the Cosmos Nemotron model families.
Sequential visual understanding
VLMs with sequential understanding relies on processing frames in a sequence, enabling them to capture the progression of actions and events over time. Unlike multi-image models, which process a limited number of frames without explicit temporal connections, sequential video models incorporate temporal dynamics by analyzing frames in an orderly sequence.
This is helpful in situations where you want to capture the causality of events in a video, as it enables the model to track the progression of actions over time and establish cause-effect relationships.
The next few examples are based on the following video where a worker drops a box in a warehouse. We discuss a few prompts and responses.
Figure 5. Input warehouse video to the VLM for scene understanding
Prompt: What happened in this video?
VLM response:
The video shows a warehouse worker in a yellow safety vest and hard hat performing various tasks. The worker is seen walking through the warehouse, carrying boxes, and interacting with the shelving units. The warehouse has green metal shelving with labels like "C", "D", "E", and "F" on the aisles.
The worker appears to be in the process of organizing or retrieving items from the shelves. At one point, they're seen bending down to pick up a box from the floor. The worker's movements suggest they're actively engaged in their duties, possibly sorting, stocking, or preparing items for shipment.
The warehouse environment is well-lit, with fluorescent lighting illuminating the space. The worker's safety gear, including the reflective vest and hard hat, indicates adherence to safety protocols in this industrial setting.
Overall, the video captures a typical day in the life of a warehouse worker, showcasing the physical nature of the job and the organized layout of the storage facility.
The model gave a detailed response with specifics about the environment such as aisle labels and lighting conditions but failed to correctly identify that the worker dropped a box. Now try to ask this question directly.
Prompt: Did the worker drop any box?
VLM response:
Yes, the worker did drop a box. In the video, we can see a box lying on the floor in front of the worker. This suggests that the worker has accidentally dropped the box while handling it.
The VLM is now able to deduce that the worker might have dropped the box. This suggests that generic prompts might not capture all the nuances correctly unless you direct the VLM to look for something specific. Modify the prompt a little.
Prompt: Elaborate on the visual and narrative elements of this warehouse video in detail. Highlight all anomalies or hazards if present.
VLM response:
The video captures a warehouse with tall metal shelving units labeled with letters 'C', 'D', 'E', and 'F'. The shelves are stocked with various items, including boxes and packages. The floor is concrete, and the lighting is artificial, typical of an indoor industrial setting. A worker wearing a high-visibility vest, a hard hat, and dark pants is seen walking down the aisle between the shelves. The worker is carrying a cardboard box and then drops the box on the floor.
The model is now capturing all the important details of the scene along with important events, such as workers dropping a box. The detailed response is useful if you want to perform Q&A with the video later. VLMs with video understanding are also able to address directional questions, as they can understand actions over time.
Prompt: From which direction did the worker enter the scene?
VLM response:
The worker entered the scene from the right side.
However, these models often lack precise temporal localization, which can limit their ability to pinpoint when specific events occur.
Prompt: When did the worker drop the box?
VLM response:
The worker dropped the box in the second image.
Temporal localization
VLMs incorporating Localization Interpretable Temporal Attention (LITA) or similar temporal localization techniques elevate video understanding by explicitly learning when and where critical events occur. The models achieve this by introducing temporal attention mechanisms that focus on key segments of a video, aligning visual and textual modalities more effectively over time. This helps answer the “when” questions in the videos.
Prompt: When did the worker drop the box?
VLM response:
0:04-0:06 The worker dropped the box between 4s and 6s. During this time, the worker is seen dropping the box on the ground.
The optimal prompt format for a VLM depends on the model’s architecture and the nature of the caption pairs used during training. Different training datasets influence how a VLM interprets the prompt.
Conclusion
This post walked through how VLMs have evolved from supporting only single-image input to being capable of complex temporal reasoning on long video inputs. To get started with VLMs, visit build.nvidia.com and try out some of the prompts shown in this post. For technical questions, see the Visual AI Agent forum.
VLMs can be used to build various visual AI agents. Explore one of the sample vision AI workflows on the /NVIDIA/metropolis-nim-workflows GitHub repo with no GPU required. To build a video analytics AI agent, try the AI blueprint for video search and summarization.
For more information about VLMs and visual AI agents, register for the upcoming Vision for All: Unlocking Video Analytics with AI Agents webinar.
For more information about LLM prompting, see An Introduction to Large Language Models: Prompt Engineering and P-Tuning.
Related resources
- DLI course: Building Real-Time Video AI Applications
- GTC session: Designing VLM-Based AI Agents for Large-Scale Video Analysis
- GTC session: Vision Language Models at NVIDIA
- GTC session: Build Next-Gen Agents With Large Vision Language Models
- NGC Containers: VLM Inference Service (Jetson)
- Webinar: Vision for All: Unlocking Video Analytics With AI Agents
Shubham Agrawal
AI Developer Technology Engineer, NVIDIA
Samuel Ochoa
Technical Marketing Engineer, Metropolis Team, NVIDIA