Multimodal Large Language Models

LLMs and MLLMs

The past decade-plus has seen incredible progress in practical computer vision. Thanks to deep learning, computer vision is dramatically more robust and accessible, and has enabled compelling capabilities in thousands of applications, from automotive safety to healthcare. But today’s widely used deep learning techniques suffer from serious limitations. Often, they struggle when confronted with ambiguity (e.g., are those people fighting or dancing?) or with challenging imaging conditions (e.g., is that shadow in the fog a person or a shrub?). And, for many product developers, computer vision remains out of reach due to the cost and complexity of obtaining the necessary training data, or due to lack of necessary technical skills.

Recent advances in large language models (LLMs) and their variants such as vision language models (VLMs, which comprehend both images and text), hold the key to overcoming these challenges. VLMs are an example of multimodal large language models (MLLMs), which integrate multiple data modalities such as language, images, audio, and video to enable complex cross-modal understanding and generation tasks. MLLMs represent a significant evolution in AI by combining the capabilities of LLMs with multimodal processing to handle diverse inputs and outputs.

The purpose of this portal is to facilitate awareness of, and education regarding, the challenges and opportunities in using LLMs, VLMs, and other types of MLLMs in practical applications — especially applications involving  edge AI and machine perception. The content that follows (which is updated regularly) discusses these topics. As a starting point, we encourage you to watch the recording of the symposium “Your Next Computer Vision Model Might be an LLM: Generative AI and the Move From Large Language Models to Vision Language Models“, sponsored by the Edge AI and Vision Alliance. A preview video of the symposium introduction by Jeff Bier, Founder of the Alliance, follows:

If there are topics related to LLMs, VLMs or other types of MLLMs that you’d like to learn about and don’t find covered below, please email us at [email protected] and we’ll consider adding content on these topics in the future.

View all LLM and MLLM Content

BrainChip Demonstration of LLM-RAG with a Custom Trained TENNs Model

Kurt Manninen, Senior Solutions Architect at BrainChip, demonstrates the company’s latest edge AI and vision technologies and products at the September 2024 Edge AI and Vision Alliance Forum. Specifically, Manninen demonstrates his company’s Temporal Event-Based Neural Network (TENN) foundational large language model with 330M parameters, augmented with a Retrieval-Augmented Generative (RAG) output to replace user

Read More »

How AI and Smart Glasses Give You a New Perspective on Real Life

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. When smart glasses are paired with generative artificial intelligence, they become the ideal way to interact with your digital assistant They may be shades, but smart glasses are poised to give you a clearer view of everything

Read More »

Using Generative AI to Enable Robots to Reason and Act with ReMEmbR

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Vision-language models (VLMs) combine the powerful language understanding of foundational LLMs with the vision capabilities of vision transformers (ViTs) by projecting text and images into the same embedding space. They can take unstructured multimodal data, reason over

Read More »

“Entering the Era of Multimodal Perception,” a Presentation from Connected Vision Advisors

Simon Morris, Serial Tech Entrepreneur and Start-Up Advisor at Connected Vision Advisors, presents the “Entering the Era of Multimodal Perception” tutorial at the May 2024 Embedded Vision Summit. Humans rely on multiple senses to quickly and accurately obtain the most important information we need. Similarly, developers have begun using multiple… “Entering the Era of Multimodal

Read More »

SiMa.ai Expands ONE Platform for Edge AI with MLSoC Modalix, a New Product Family for Generative AI

Industry’s first multi-modal, software-centric edge AI platform supports any edge AI model from CNNs to multi-modal GenAI and everything in between with scalable performance per watt SAN JOSE, Calif.–(BUSINESS WIRE)–SiMa.ai, the software-centric, embedded edge machine learning system-on-chip (MLSoC) company, today announced MLSoC™ Modalix, the industry’s first multi-modal edge AI product family. SiMa.ai MLSoC Modalix supports

Read More »

“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci

István Fehérvári, Director of Data and ML at BenchSci, presents the “Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI” tutorial at the May 2024 Embedded Vision Summit. Multimodal large language models represent a transformative breakthrough in artificial intelligence, blending the power of natural language processing with visual… “Unveiling the Power of Multimodal

Read More »

May 2024 Embedded Vision Summit Opening Remarks (May 23)

Jeff Bier, Founder of the Edge AI and Vision Alliance, welcomes attendees to the May 2024 Embedded Vision Summit on May 23, 2024. Bier provides an overview of the edge AI and vision market opportunities, challenges, solutions and trends. He also introduces the Edge AI and Vision Alliance and the… May 2024 Embedded Vision Summit

Read More »

May 2024 Embedded Vision Summit Opening Remarks (May 22)

Jeff Bier, Founder of the Edge AI and Vision Alliance, welcomes attendees to the May 2024 Embedded Vision Summit on May 22, 2024. Bier provides an overview of the edge AI and vision market opportunities, challenges, solutions and trends. He also introduces the Edge AI and Vision Alliance and the… May 2024 Embedded Vision Summit

Read More »

“Understand the Multimodal World with Minimal Supervision,” a Keynote Presentation from Yong Jae Lee

Yong Jae Lee, Associate Professor in the Department of Computer Sciences at the University of Wisconsin-Madison and CEO of GivernyAI, presents the “Learning to Understand Our Multimodal World with Minimal Supervision” tutorial at the May 2024 Embedded Vision Summit. The field of computer vision is undergoing another profound change. Recently,… “Understand the Multimodal World with

Read More »

Snapdragon Powers the Future of AI in Smart Glasses. Here’s How

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. A Snapdragon Insider chats with Qualcomm Technologies’ Said Bakadir about the future of smart glasses and Qualcomm Technologies’ role in turning it into a critical AI tool Artificial intelligence (AI) is increasingly winding its way through our

Read More »

Build VLM-powered Visual AI Agents Using NVIDIA NIM and NVIDIA VIA Microservices

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Traditional video analytics applications and their development workflow are typically built on fixed-function, limited models that are designed to detect and identify only a select set of predefined objects. With generative AI, NVIDIA NIM microservices, and foundation

Read More »

Quantization: Unlocking Scalability for Large Language Models

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm Find out how LLM quantization solves the challenges of making AI work on device In the rapidly evolving world of artificial intelligence (AI), the growth of large language models (LLMs) has been nothing short of astounding. These

Read More »

Nota AI Demonstration of Elevating Traffic Safety with Vision Language Models

Tae-Ho Kim, CTO and Co-founder of Nota AI, demonstrates the company’s latest edge AI and vision technologies and products at the 2024 Embedded Vision Summit. Specifically, Kim demonstrates his company’s Vision Language Model (VLM) solution, designed to elevate vehicle safety. Advanced models analyze and interpret visual data to prevent accidents and enhance driving experiences. The

Read More »

Develop Generative AI-powered Visual AI Agents for the Edge

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. An exciting breakthrough in AI technology—Vision Language Models (VLMs)—offers a more dynamic and flexible method for video analysis. VLMs enable users to interact with image and video input using natural language, making the technology more accessible and

Read More »

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

Berkeley Design Technology, Inc.
PO Box #4446
Walnut Creek, CA 94596

Phone
Phone: +1 (925) 954-1411
Scroll to Top