An Easy Introduction to Multimodal Retrieval-augmented Generation for Video and Audio

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA.

Building a multimodal retrieval augmented generation (RAG) system is challenging. The difficulty comes from capturing and indexing information from across multiple modalities, including text, images, tables, audio, video, and more. In our previous post, An Easy Introduction to Multimodal Retrieval-Augmented Generation, we discussed how to tackle text and images. This post extends this conversation to audio and videos. Specifically, we explore how to build a multimodal RAG pipeline to search information in videos.

Building RAG for text, images, and videos

Building on first principles, we can say that there are three approaches for building a RAG pipeline that works across multiple modalities, as detailed below and in Figure 1.

Using a common embedding space

The first approach for building a RAG pipeline that works across multiple modalities is using a common embedding space. This approach relies on a single model to project representations of information stored across different modalities in the same embedding space. Using models like CLIP that have encoders for both images and text falls into this category. The upside for using this approach is reduced architectural complexity. Depending on the diversity of data used to train the model, the flexibility of applicable use cases can also be considered.

The downside of this approach is that it’s difficult to tune a model that can handle more than two modalities, or that can even handle a multitude of submodalities. For example, CLIP performs well with natural images and matching them to text descriptions. However, encoding only text, or even synthetic images, is not a viable option. Fine-tuning is an option to improve model performance, but creating a single embedding model to encode all forms of information is not an easy task.

Building N parallel retrieval pipelines (brute force)

A second method is to make a modality or even a submodality native search and query all pipelines. This will result in multiple sets of chunks that spread across different modalities. In this case, two issues arise. First, the number of tokens that need to be ingested by a large language model (LLM) to generate an answer has been massively increased, thus increasing the cost of running the RAG pipeline. Second, an LLM that can absorb information across multiple modalities is needed. This approach simply moves the problem from the retrieval phase to the generation phase and increases the cost, but in turn simplifies the ingest process and infrastructure.

Grounding in a common modality

Lastly, information can be ingested from all modalities and grounded in one common modality, such as text. What this means is that all the key information from images, PDFs, videos, audio, and so on, needs to be converted into text for setting up the pipeline. This approach incurs some ingestion cost, and can lead to lossy embeddings, but can be used to unify all modalities effectively for both retrieval and generation.

Figure 1. Three different approaches that can be adopted to build a multimodal retrieval pipeline

As text grounding provides flexibility for handling multiple submodalities, the ability to perform targeted model tuning simplifies both search and answer generation by just incurring a one-time ingestion cost. It serves as an excellent architectural backbone to build reliable RAG pipelines.

The following sections explore how to use this philosophy to build a multimodal RAG pipeline to search information in videos.

Figure 2. Text grounding involves transforming information from multiple modalities into one. This transformation happens as part of file ingestion using modality-specific ingestion algorithms and models

Complexities with retrieving videos

Video content comes in all shapes and sizes—short clips on social media, lengthy tutorials, educational series, entertainment, and even surveillance footage. Each type of content holds information in its own unique way, making video retrieval a bit of a balancing act.

Think of this variety on a spectrum (Figure 3). Along a horizontal axis, see the video content ranging from highly unstructured (such as real-world videos) to structured (such as tutorials). Along a vertical axis, see how information is conveyed. At one end, videos are rich in action sequences, where temporal information is critical to understand the video. On the opposite end, videos consist of frames where each can stand alone without relying on sequence for understanding.

Figure 3. Different types of videos require different architectural considerations

Another point to note is that the information is spread across both audio and visual formats. While we discussed grounding in a common modality like text as one reliable method for setting up a RAG pipeline, we need to also help ensure that the information extracted as ‘text’ across these two modalities is aligned appropriately.

There are essentially two different methods in which information is encoded in audio:

Emotive: This information generally carries with it the underpinning contextual information about the origin of the information. For instance, it may be the emotive tone of the speaker or a souring background score meant to evoke emotions for the consumer of the video.
Objective: Any linguistic information is considered objective information, as there isn’t any room for subjectivity.

To process visuals, there are three challenges to consider:

Cost of processing: Processing a video is computationally intensive. Each second of video typically contains 30 or 60 frames, depending on the frame rate, which quickly adds up in terms of storage and processing. For example, a 10-minute video at 60 frames per second (FPS), contains 18,000 frames, each requiring some level of processing to extract content and retrieve accurately.
Extracting information from frames: As discussed in previous posts, it is difficult to extract and represent information from images because of the density of information they contain.
Preserving “actions” spread across multiple frames: Successive frames can also capture specific actions which carry significant information. Recognition actions in the larger context and representing that information with the correct weightage is hard.

Building RAG for video

To provide an example use case, imagine that we want to ask questions on a repository of video explainers where a concept is being explained or a method is being demonstrated. These videos look like lecture recordings, meeting recordings, keynote presentations, educational how-to and step-by-step videos. Since this use case is primarily focused on communicating and explaining information, we don’t need to focus on emotive information from audio, and preserving actions from visuals.

Figure 4 shows a high-level architecture of the pipeline with five major segments:

Audio ingestion
Video ingestion
Blending audio and video information
Setting up the retriever
Generating answers

Figure 4. System architecture for video ingestion and retrieval

Audio ingestion

There is no emotive connotation required to be encoded, so the audio just needs to be transcribed. To that end, we employ an automatic speech recognition (ASR) pipeline built using the NVIDIA Parakeet-CTC-0.6B-ASR model. Using the model, transcribe the audio and create interim text chunks along with the word level timestamps for the utterances. While this case doesn’t require an emotive connotation, speech emotion recognition models or an audio language model can be used to give text based descriptions.

Video ingestion

Given the use case, we will focus on reducing the cost of processing the video and extracting information from frames.

Most videos are recorded and stored at either 30 or 60 FPS. This means that 1 minute of video can include up to 3,600 frames. While for visual consumption, this high framerate makes for a conducive experience, there is usually very little difference between the information among successive frames. While the brute force method would be to process every frame, this would be an extremely expensive task. Therefore, we need to reduce the number of candidate frames to be processed.

A simple first step is to downsample the video down to 4 FPS, drastically reducing the number of frames to be processed down to 240 frames. Video frames usually have only marginal differences across them as the time differential between each of them is relatively small. This leads to a situation where most frames have overlapping information.

The natural next step is to identify the key frames that are “local maximas” for the amount of information carried in them, which involves three steps.

First, chapterize the video by identifying shot boundaries. This can be done either using classical computer vision techniques by leveraging the patterns in the changing colorspace or by leveraging a shot detection model. The chapterization is required to create a local context to judge the information present in the frames.

Second, identify “key clips” from each chapter. This requires identifying all the clips where visuals are perceptually unique or are capturing some unique activity. This can be done either using an image encoder or a signal processing filter that can capture the difference in two successive frames.

For the sake of simplicity, suppose we use Structural Similarity Index (SSIM) to calculate the difference across frames (Figure 5). We then identify these clips by using the starting frame where the dissimilarity is more than a standard deviation from the mean for that scene and the ending frame as the local maxima for the similarity metric. Using the one standard deviation difference signifies unique differences and waiting until the local maxima is achieved again enables capturing the bulk of the motion as successive structurally similar frames start maximizing the similarity score.

Figure 5. Structural Similarity Index plotted across a scene

Finally, once these clips are detected, we reject all blurred filters and duplicates. Then select all the frames with high entropy, as they have significantly more information.

All the discussed steps help reduce the number of frames to 40, which is considerably lower than 3,600 per minute to process. Note that we are using classical computer vision algorithms for this explanation. Using fine-tuned models for frame selection and deduplication will yield even fewer selected frames, further reducing the number of frames to be processed.

Now that we have the representative frames extracted, the next step is to extract all possible information from them. For this, we use a Llama-3-90B VLM NIM. We prompt the VLM to generate transcription for all the text and information on the screen as well as generate a semantic description.

Blending audio and video information

Textual information extracted from both audio and visual content is laced together to get a consolidated extraction from the video. NVIDIA Riva ASR transcribes with word-level timestamps, making it easy to correlate back to the key frames selected.

Remember that the information from the key frames (previously selected) must align to the audio that is being played out around the frame. This enables the key frame to get context of what’s going on temporally. One straightforward approach is to blend the information at scene level using the scene timestamps previously generated. With that, we can pull out the audio within the timestamps, and append the text extracted from multiple key frames to the audio chunk, to create a blob of text for the scene.

It’s possible to attempt to neatly blend the video content with the audio content, at timestamp level, instead of scene level. Note that there can be duplicate information with these approaches where the video content is directly appended with audio, For example, if the presenter in the audio is talking over a slide, the audio and video content will have a great overlap generating extra tokens, adding to extra latency and cost.

Alternatively, we can use a smaller LLM to help append extracted visual context to the audio. An additional LLM call to help reduce tokens during Ingestion can help save you the time and cost during real-time inference of retrieval and answer generation.

Setting up the retriever

Post text grounded audio-visual blending, we now have a coherent text description of the video. We also retain the timestamps for word-level utterances and frames along with file level metadata, such as the name of the file. Using this information, we create chunks augmented with the metadata and generate embeddings using an embedding model. These embeddings are stored in a vector database along with chunk-level timestamps as metadata.

Generating answers

With the vector store set up, it’s now possible to talk to your videos. With an incoming user query, embed the query to retrieve, and then rerank to get the most relevant chunk. These chunks are then served to the LLM as context, to generate an answer. Appropriate metadata attached to the chunks can help provide the referenced videos and the timestamps, which were referred to answer the question.

Figure 6. A RAG pipeline with CUDA Setup Tutorials as the knowledge base

Get started

Ready to start building multimodal RAG pipelines like the one discussed in this post? Use the NVIDIA NIM microservices and NVIDIA Blueprint examples in the NVIDIA API catalog.

Related resources

Tanay Varshney
Senior Developer Advocate Engineer, NVIDIA

Annie Surla
Developer Advocate Engineer, NVIDIA

If you're building AI or vision-enabled products, you've come to the right place.