LLMs and MLLMs
The past decade-plus has seen incredible progress in practical computer vision. Thanks to deep learning, computer vision is dramatically more robust and accessible, and has enabled compelling capabilities in thousands of applications, from automotive safety to healthcare. But today’s widely used deep learning techniques suffer from serious limitations. Often, they struggle when confronted with ambiguity (e.g., are those people fighting or dancing?) or with challenging imaging conditions (e.g., is that shadow in the fog a person or a shrub?). And, for many product developers, computer vision remains out of reach due to the cost and complexity of obtaining the necessary training data, or due to lack of necessary technical skills.
Recent advances in large language models (LLMs) and their variants such as vision language models (VLMs, which comprehend both images and text), hold the key to overcoming these challenges. VLMs are an example of multimodal large language models (MLLMs), which integrate multiple data modalities such as language, images, audio, and video to enable complex cross-modal understanding and generation tasks. MLLMs represent a significant evolution in AI by combining the capabilities of LLMs with multimodal processing to handle diverse inputs and outputs.
The purpose of this portal is to facilitate awareness of, and education regarding, the challenges and opportunities in using LLMs, VLMs, and other types of MLLMs in practical applications — especially applications involving edge AI and machine perception. The content that follows (which is updated regularly) discusses these topics. As a starting point, we encourage you to watch the recording of the symposium “Your Next Computer Vision Model Might be an LLM: Generative AI and the Move From Large Language Models to Vision Language Models“, sponsored by the Edge AI and Vision Alliance. A preview video of the symposium introduction by Jeff Bier, Founder of the Alliance, follows:
If there are topics related to LLMs, VLMs or other types of MLLMs that you’d like to learn about and don’t find covered below, please email us at [email protected] and we’ll consider adding content on these topics in the future.
View all LLM and MLLM Content
BrainChip Demonstration of LLM-RAG with a Custom Trained TENNs Model
Kurt Manninen, Senior Solutions Architect at BrainChip, demonstrates the company’s latest edge AI and vision technologies and products at the September 2024 Edge AI and Vision Alliance Forum. Specifically, Manninen demonstrates his company’s Temporal Event-Based Neural Network (TENN) foundational large language model with 330M parameters, augmented with a Retrieval-Augmented Generative (RAG) output to replace user
How AI and Smart Glasses Give You a New Perspective on Real Life
This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. When smart glasses are paired with generative artificial intelligence, they become the ideal way to interact with your digital assistant They may be shades, but smart glasses are poised to give you a clearer view of everything
Using Generative AI to Enable Robots to Reason and Act with ReMEmbR
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Vision-language models (VLMs) combine the powerful language understanding of foundational LLMs with the vision capabilities of vision transformers (ViTs) by projecting text and images into the same embedding space. They can take unstructured multimodal data, reason over
“Entering the Era of Multimodal Perception,” a Presentation from Connected Vision Advisors
Simon Morris, Serial Tech Entrepreneur and Start-Up Advisor at Connected Vision Advisors, presents the “Entering the Era of Multimodal Perception” tutorial at the May 2024 Embedded Vision Summit. Humans rely on multiple senses to quickly and accurately obtain the most important information we need. Similarly, developers have begun using multiple… “Entering the Era of Multimodal
SiMa.ai Expands ONE Platform for Edge AI with MLSoC Modalix, a New Product Family for Generative AI
Industry’s first multi-modal, software-centric edge AI platform supports any edge AI model from CNNs to multi-modal GenAI and everything in between with scalable performance per watt SAN JOSE, Calif.–(BUSINESS WIRE)–SiMa.ai, the software-centric, embedded edge machine learning system-on-chip (MLSoC) company, today announced MLSoC™ Modalix, the industry’s first multi-modal edge AI product family. SiMa.ai MLSoC Modalix supports
“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci
István Fehérvári, Director of Data and ML at BenchSci, presents the “Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI” tutorial at the May 2024 Embedded Vision Summit. Multimodal large language models represent a transformative breakthrough in artificial intelligence, blending the power of natural language processing with visual… “Unveiling the Power of Multimodal
Multimodal AI is Having Its Moment In the Sun. Here’s Why It’s So Important
This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. Multimodal AI takes in different inputs like text, images or video, allowing digital assistants to better understand the world and you, and gets supercharged when it’s able to run on your device As smart as generative artificial
“Multimodal LLMs at the Edge: Are We There Yet?,” An Embedded Vision Summit Expert Panel Discussion
Sally Ward-Foxton, Senior Reporter at EE Times, moderates the “Multimodal LLMs at the Edge: Are We There Yet?” Expert Panel at the May 2024 Embedded Vision Summit. Other panelists include Adel Ahmadyan, Staff Engineer at Meta Reality Labs, Jilei Hou, Vice President of Engineering and Head of AI Research at… “Multimodal LLMs at the Edge:
May 2024 Embedded Vision Summit Opening Remarks (May 23)
Jeff Bier, Founder of the Edge AI and Vision Alliance, welcomes attendees to the May 2024 Embedded Vision Summit on May 23, 2024. Bier provides an overview of the edge AI and vision market opportunities, challenges, solutions and trends. He also introduces the Edge AI and Vision Alliance and the… May 2024 Embedded Vision Summit
May 2024 Embedded Vision Summit Opening Remarks (May 22)
Jeff Bier, Founder of the Edge AI and Vision Alliance, welcomes attendees to the May 2024 Embedded Vision Summit on May 22, 2024. Bier provides an overview of the edge AI and vision market opportunities, challenges, solutions and trends. He also introduces the Edge AI and Vision Alliance and the… May 2024 Embedded Vision Summit
“Understand the Multimodal World with Minimal Supervision,” a Keynote Presentation from Yong Jae Lee
Yong Jae Lee, Associate Professor in the Department of Computer Sciences at the University of Wisconsin-Madison and CEO of GivernyAI, presents the “Learning to Understand Our Multimodal World with Minimal Supervision” tutorial at the May 2024 Embedded Vision Summit. The field of computer vision is undergoing another profound change. Recently,… “Understand the Multimodal World with
Snapdragon Powers the Future of AI in Smart Glasses. Here’s How
This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. A Snapdragon Insider chats with Qualcomm Technologies’ Said Bakadir about the future of smart glasses and Qualcomm Technologies’ role in turning it into a critical AI tool Artificial intelligence (AI) is increasingly winding its way through our
Build VLM-powered Visual AI Agents Using NVIDIA NIM and NVIDIA VIA Microservices
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Traditional video analytics applications and their development workflow are typically built on fixed-function, limited models that are designed to detect and identify only a select set of predefined objects. With generative AI, NVIDIA NIM microservices, and foundation
Quantization: Unlocking Scalability for Large Language Models
This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm Find out how LLM quantization solves the challenges of making AI work on device In the rapidly evolving world of artificial intelligence (AI), the growth of large language models (LLMs) has been nothing short of astounding. These
Nota AI Demonstration of Elevating Traffic Safety with Vision Language Models
Tae-Ho Kim, CTO and Co-founder of Nota AI, demonstrates the company’s latest edge AI and vision technologies and products at the 2024 Embedded Vision Summit. Specifically, Kim demonstrates his company’s Vision Language Model (VLM) solution, designed to elevate vehicle safety. Advanced models analyze and interpret visual data to prevent accidents and enhance driving experiences. The
Develop Generative AI-powered Visual AI Agents for the Edge
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. An exciting breakthrough in AI technology—Vision Language Models (VLMs)—offers a more dynamic and flexible method for video analysis. VLMs enable users to interact with image and video input using natural language, making the technology more accessible and