Multimodal Large Language Models

LLMs and MLLMs

The past decade-plus has seen incredible progress in practical computer vision. Thanks to deep learning, computer vision is dramatically more robust and accessible, and has enabled compelling capabilities in thousands of applications, from automotive safety to healthcare. But today’s widely used deep learning techniques suffer from serious limitations. Often, they struggle when confronted with ambiguity (e.g., are those people fighting or dancing?) or with challenging imaging conditions (e.g., is that shadow in the fog a person or a shrub?). And, for many product developers, computer vision remains out of reach due to the cost and complexity of obtaining the necessary training data, or due to lack of necessary technical skills.

Recent advances in large language models (LLMs) and their variants such as vision language models (VLMs, which comprehend both images and text), hold the key to overcoming these challenges. VLMs are an example of multimodal large language models (MLLMs), which integrate multiple data modalities such as language, images, audio, and video to enable complex cross-modal understanding and generation tasks. MLLMs represent a significant evolution in AI by combining the capabilities of LLMs with multimodal processing to handle diverse inputs and outputs.

The purpose of this portal is to facilitate awareness of, and education regarding, the challenges and opportunities in using LLMs, VLMs, and other types of MLLMs in practical applications — especially applications involving  edge AI and machine perception. The content that follows (which is updated regularly) discusses these topics. As a starting point, we encourage you to watch the recording of the symposium “Your Next Computer Vision Model Might be an LLM: Generative AI and the Move From Large Language Models to Vision Language Models“, sponsored by the Edge AI and Vision Alliance. A preview video of the symposium introduction by Jeff Bier, Founder of the Alliance, follows:


If there are topics related to LLMs, VLMs or other types of MLLMs that you’d like to learn about and don’t find covered below, please email us at [email protected] and we’ll consider adding content on these topics in the future.

View all LLM and MLLM Content

“Vision Language Models for Regulatory Compliance, Quality Control and Safety Applications,” a Presentation from Camio

Carter Maslan, CEO of Camio, presents the “Vision Language Models for Regulatory Compliance, Quality Control and Safety Applications” tutorial at the December 2024 Edge AI and Vision Innovation Forum. In this presentation, you’ll learn how vision language models interpret policy text to enable much more sophisticated understanding of scenes and human behavior compared with current-generation

Read More »

Snapdragon Summit’s AI Highlights: A Look at the Future of On-device AI

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. Qualcomm Technologies sets new standards in AI performance for its latest mobile, automotive and Qualcomm AI Hub advancements Our annual Snapdragon Summit wrapped up with exciting new announcements centered on the future of on-device artificial intelligence (AI).

Read More »

How to Accelerate Larger LLMs Locally on RTX With LM Studio

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. GPU offloading makes massive models accessible on local RTX AI PCs and workstations. Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware,

Read More »

Introducing the First AMD 1B Language Models: AMD OLMo

This blog post was originally published at AMD’s website. It is reprinted here with the permission of AMD. In recent years, the rapid development of artificial intelligence technology, especially the progress in large language models (LLMs), has garnered significant attention and discussion. From the emergence of ChatGPT to subsequent models like GPT-4 and Llama, these

Read More »

Qualcomm and Mistral AI Partner to Bring New Generative AI Models to Edge Devices

Highlights: Qualcomm announces collaboration with Mistral AI to bring Mistral AI’s models to devices powered by Snapdragon and Qualcomm platforms. Mistral AI’s new state-of-the-art models, Ministral 3B and Ministral 8B, are being optimized to run on devices powered by the new Snapdragon 8 Elite Mobile Platform, Snapdragon Cockpit Elite and Snapdragon Ride Elite, and Snapdragon

Read More »

Qualcomm Announces Multi-year Strategic Collaboration with Google to Deliver Generative AI Digital Cockpit Solutions

Highlights: Qualcomm and Google will leverage Snapdragon Digital Chassis and Google’s in-vehicle technologies to produce a standardized reference framework for development of generative AI-enabled digital cockpits and software-defined vehicles (SDV). Qualcomm to lead go-to-market efforts for scaling and customization of joint solution with the broader automotive ecosystem. Companies’ collaboration demonstrates power of co-innovation, empowering automakers

Read More »

“Using Vision Systems, Generative Models and Reinforcement Learning for Sports Analytics,” a Presentation from Sportlogiq

Mehrsan Javan, Chief Technology Officer at Sportlogiq, presents the “Using Vision Systems, Generative Models and Reinforcement Learning for Sports Analytics” tutorial at the May 2024 Embedded Vision Summit. At a high level, sport analytics systems can be broken into two components: sensory data collection and analytical models that turn sensory… “Using Vision Systems, Generative Models

Read More »

Smart Glasses for the Consumer Market

There are currently about 250 companies in the head mounted wearables category and these companies in aggregate have received over $5B in funding. $700M has been invested in this category just since the beginning of the year. On the M&A front, there have already been a number of significant acquisitions in the space, notably the

Read More »

“Data-efficient and Generalizable: The Domain-specific Small Vision Model Revolution,” a Presentation from Pixel Scientia Labs

Heather Couture, Founder and Computer Vision Consultant at Pixel Scientia Labs, presents the “Data-efficient and Generalizable: The Domain-specific Small Vision Model Revolution” tutorial at the May 2024 Embedded Vision Summit. Large vision models (LVMs) trained on a large and diverse set of imagery are revitalizing computer vision, just as LLMs… “Data-efficient and Generalizable: The Domain-specific

Read More »

Accelerating LLMs with llama.cpp on NVIDIA RTX Systems

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. The NVIDIA RTX AI for Windows PCs platform offers a thriving ecosystem of thousands of open-source models for application developers to leverage and integrate into Windows applications. Notably, llama.cpp is one popular tool, with over 65K GitHub

Read More »

“Bridging Vision and Language: Designing, Training and Deploying Multimodal Large Language Models,” a Presentation from Meta Reality Labs

Adel Ahmadyan, Staff Engineer at Meta Reality Labs, presents the “Bridging Vision and Language: Designing, Training and Deploying Multimodal Large Language Models” tutorial at the May 2024 Embedded Vision Summit. In this talk, Ahmadyan explores the use of multimodal large language models in real-world edge applications. He begins by explaining… “Bridging Vision and Language: Designing,

Read More »

Deploying Accelerated Llama 3.2 from the Edge to the Cloud

This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an updated Llama Guard model with support for vision. When paired with the NVIDIA accelerated

Read More »

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

Berkeley Design Technology, Inc.
PO Box #4446
Walnut Creek, CA 94596

Phone
Phone: +1 (925) 954-1411
Scroll to Top