Multimodal Large Language Models

LLMs and MLLMs

The past decade-plus has seen incredible progress in practical computer vision. Thanks to deep learning, computer vision is dramatically more robust and accessible, and has enabled compelling capabilities in thousands of applications, from automotive safety to healthcare. But today’s widely used deep learning techniques suffer from serious limitations. Often, they struggle when confronted with ambiguity (e.g., are those people fighting or dancing?) or with challenging imaging conditions (e.g., is that shadow in the fog a person or a shrub?). And, for many product developers, computer vision remains out of reach due to the cost and complexity of obtaining the necessary training data, or due to lack of necessary technical skills.

Recent advances in large language models (LLMs) and their variants such as vision language models (VLMs, which comprehend both images and text), hold the key to overcoming these challenges. VLMs are an example of multimodal large language models (MLLMs), which integrate multiple data modalities such as language, images, audio, and video to enable complex cross-modal understanding and generation tasks. MLLMs represent a significant evolution in AI by combining the capabilities of LLMs with multimodal processing to handle diverse inputs and outputs.

The purpose of this portal is to facilitate awareness of, and education regarding, the challenges and opportunities in using LLMs, VLMs, and other types of MLLMs in practical applications — especially applications involving edge AI and machine perception. The content that follows (which is updated regularly) discusses these topics. As a starting point, we encourage you to watch the recording of the symposium “Your Next Computer Vision Model Might be an LLM: Generative AI and the Move From Large Language Models to Vision Language Models“, sponsored by the Edge AI and Vision Alliance. A preview video of the symposium introduction by Jeff Bier, Founder of the Alliance, follows:

If there are topics related to LLMs, VLMs or other types of MLLMs that you’d like to learn about and don’t find covered below, please email us at [email protected] and we’ll consider adding content on these topics in the future.

View all LLM and MLLM Content

Smart Glasses for the Consumer Market

October 15, 2024

There are currently about 250 companies in the head mounted wearables category and these companies in aggregate have received over $5B in funding. $700M has been invested in this category just since the beginning of the year. On the M&A front, there have already been a number of significant acquisitions in the space, notably the

“Data-efficient and Generalizable: The Domain-specific Small Vision Model Revolution,” a Presentation from Pixel Scientia Labs

October 10, 2024

Heather Couture, Founder and Computer Vision Consultant at Pixel Scientia Labs, presents the “Data-efficient and Generalizable: The Domain-specific Small Vision Model Revolution” tutorial at the May 2024 Embedded Vision Summit. Large vision models (LVMs) trained on a large and diverse set of imagery are revitalizing computer vision, just as LLMs… “Data-efficient and Generalizable: The Domain-specific

Accelerating LLMs with llama.cpp on NVIDIA RTX Systems

October 8, 2024

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. The NVIDIA RTX AI for Windows PCs platform offers a thriving ecosystem of thousands of open-source models for application developers to leverage and integrate into Windows applications. Notably, llama.cpp is one popular tool, with over 65K GitHub

“Bridging Vision and Language: Designing, Training and Deploying Multimodal Large Language Models,” a Presentation from Meta Reality Labs

October 7, 2024

Adel Ahmadyan, Staff Engineer at Meta Reality Labs, presents the “Bridging Vision and Language: Designing, Training and Deploying Multimodal Large Language Models” tutorial at the May 2024 Embedded Vision Summit. In this talk, Ahmadyan explores the use of multimodal large language models in real-world edge applications. He begins by explaining… “Bridging Vision and Language: Designing,

Qualcomm Partners with Meta to Support Llama 3.2. Why This is a Big Deal for On-device AI

October 4, 2024

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. On-device artificial intelligence (AI) is critical to making your everyday AI experiences fast and security-rich. That’s why it’s such a win that Qualcomm Technologies and Meta have worked together to support the Llama 3.2 large language models (LLMs)

Deploying Accelerated Llama 3.2 from the Edge to the Cloud

October 2, 2024

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an updated Llama Guard model with support for vision. When paired with the NVIDIA accelerated

BrainChip Demonstration of LLM-RAG with a Custom Trained TENNs Model

September 30, 2024

Kurt Manninen, Senior Solutions Architect at BrainChip, demonstrates the company’s latest edge AI and vision technologies and products at the September 2024 Edge AI and Vision Alliance Forum. Specifically, Manninen demonstrates his company’s Temporal Event-Based Neural Network (TENN) foundational large language model with 330M parameters, augmented with a Retrieval-Augmented Generative (RAG) output to replace user

How AI and Smart Glasses Give You a New Perspective on Real Life

September 27, 2024

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. When smart glasses are paired with generative artificial intelligence, they become the ideal way to interact with your digital assistant They may be shades, but smart glasses are poised to give you a clearer view of everything

Using Generative AI to Enable Robots to Reason and Act with ReMEmbR

September 25, 2024

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Vision-language models (VLMs) combine the powerful language understanding of foundational LLMs with the vision capabilities of vision transformers (ViTs) by projecting text and images into the same embedding space. They can take unstructured multimodal data, reason over

“Entering the Era of Multimodal Perception,” a Presentation from Connected Vision Advisors

September 16, 2024

Simon Morris, Serial Tech Entrepreneur and Start-Up Advisor at Connected Vision Advisors, presents the “Entering the Era of Multimodal Perception” tutorial at the May 2024 Embedded Vision Summit. Humans rely on multiple senses to quickly and accurately obtain the most important information we need. Similarly, developers have begun using multiple types of sensors to improve

SiMa.ai Expands ONE Platform for Edge AI with MLSoC Modalix, a New Product Family for Generative AI

September 10, 2024

Industry’s first multi-modal, software-centric edge AI platform supports any edge AI model from CNNs to multi-modal GenAI and everything in between with scalable performance per watt SAN JOSE, Calif.–(BUSINESS WIRE)–SiMa.ai, the software-centric, embedded edge machine learning system-on-chip (MLSoC) company, today announced MLSoC™ Modalix, the industry’s first multi-modal edge AI product family. SiMa.ai MLSoC Modalix supports

“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci

September 10, 2024

István Fehérvári, Director of Data and ML at BenchSci, presents the “Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI” tutorial at the May 2024 Embedded Vision Summit. Multimodal large language models represent a transformative breakthrough in artificial intelligence, blending the power of natural language processing with visual understanding. In this talk, Fehérvári

Multimodal AI is Having Its Moment In the Sun. Here’s Why It’s So Important

September 5, 2024

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. Multimodal AI takes in different inputs like text, images or video, allowing digital assistants to better understand the world and you, and gets supercharged when it’s able to run on your device As smart as generative artificial

“Multimodal LLMs at the Edge: Are We There Yet?,” An Embedded Vision Summit Expert Panel Discussion

September 3, 2024

Sally Ward-Foxton, Senior Reporter at EE Times, moderates the “Multimodal LLMs at the Edge: Are We There Yet?” Expert Panel at the May 2024 Embedded Vision Summit. Other panelists include Adel Ahmadyan, Staff Engineer at Meta Reality Labs, Jilei Hou, Vice President of Engineering and Head of AI Research at Qualcomm Technologies, Pete Warden, CEO

May 2024 Embedded Vision Summit Opening Remarks (May 23)

September 2, 2024

Jeff Bier, Founder of the Edge AI and Vision Alliance, welcomes attendees to the May 2024 Embedded Vision Summit on May 23, 2024. Bier provides an overview of the edge AI and vision market opportunities, challenges, solutions and trends. He also introduces the Edge AI and Vision Alliance and the resources it offers for both

May 2024 Embedded Vision Summit Opening Remarks (May 22)