LLMs and MLLMs
The past decade-plus has seen incredible progress in practical computer vision. Thanks to deep learning, computer vision is dramatically more robust and accessible, and has enabled compelling capabilities in thousands of applications, from automotive safety to healthcare. But today’s widely used deep learning techniques suffer from serious limitations. Often, they struggle when confronted with ambiguity (e.g., are those people fighting or dancing?) or with challenging imaging conditions (e.g., is that shadow in the fog a person or a shrub?). And, for many product developers, computer vision remains out of reach due to the cost and complexity of obtaining the necessary training data, or due to lack of necessary technical skills.
Recent advances in large language models (LLMs) and their variants such as vision language models (VLMs, which comprehend both images and text), hold the key to overcoming these challenges. VLMs are an example of multimodal large language models (MLLMs), which integrate multiple data modalities such as language, images, audio, and video to enable complex cross-modal understanding and generation tasks. MLLMs represent a significant evolution in AI by combining the capabilities of LLMs with multimodal processing to handle diverse inputs and outputs.
The purpose of this portal is to facilitate awareness of, and education regarding, the challenges and opportunities in using LLMs, VLMs, and other types of MLLMs in practical applications — especially applications involving edge AI and machine perception. The content that follows (which is updated regularly) discusses these topics. As a starting point, we encourage you to watch the recording of the symposium “Your Next Computer Vision Model Might be an LLM: Generative AI and the Move From Large Language Models to Vision Language Models“, sponsored by the Edge AI and Vision Alliance. A preview video of the symposium introduction by Jeff Bier, Founder of the Alliance, follows:
If there are topics related to LLMs, VLMs or other types of MLLMs that you’d like to learn about and don’t find covered below, please email us at [email protected] and we’ll consider adding content on these topics in the future.
View all LLM and MLLM Content
“Vision Language Models for Regulatory Compliance, Quality Control and Safety Applications,” a Presentation from Camio
Carter Maslan, CEO of Camio, presents the “Vision Language Models for Regulatory Compliance, Quality Control and Safety Applications” tutorial at the December 2024 Edge AI and Vision Innovation Forum. In this presentation, you’ll learn how vision language models interpret policy text to enable much more sophisticated understanding of scenes and human behavior compared with current-generation
Snapdragon Summit’s AI Highlights: A Look at the Future of On-device AI
This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. Qualcomm Technologies sets new standards in AI performance for its latest mobile, automotive and Qualcomm AI Hub advancements Our annual Snapdragon Summit wrapped up with exciting new announcements centered on the future of on-device artificial intelligence (AI).
How to Accelerate Larger LLMs Locally on RTX With LM Studio
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. GPU offloading makes massive models accessible on local RTX AI PCs and workstations. Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware,
Introducing the First AMD 1B Language Models: AMD OLMo
This blog post was originally published at AMD’s website. It is reprinted here with the permission of AMD. In recent years, the rapid development of artificial intelligence technology, especially the progress in large language models (LLMs), has garnered significant attention and discussion. From the emergence of ChatGPT to subsequent models like GPT-4 and Llama, these
Give AI a Look: Any Industry Can Now Search and Summarize Vast Volumes of Visual Data
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Accenture, Dell Technologies and Lenovo are among the companies tapping a new NVIDIA AI Blueprint to develop visual AI agents that can boost productivity, optimize processes and create safer spaces. Enterprises and public sector organizations around the
“How Large Language Models Are Impacting Computer Vision,” a Presentation from Voxel51
Jacob Marks, Senior ML Engineer and Researcher at Voxel51, presents the “How Large Language Models Are Impacting Computer Vision” tutorial at the May 2024 Embedded Vision Summit. Large language models (LLMs) are revolutionizing the way we interact with computers and the world around us. However, in order to truly understand… “How Large Language Models Are
Qualcomm and Mistral AI Partner to Bring New Generative AI Models to Edge Devices
Highlights: Qualcomm announces collaboration with Mistral AI to bring Mistral AI’s models to devices powered by Snapdragon and Qualcomm platforms. Mistral AI’s new state-of-the-art models, Ministral 3B and Ministral 8B, are being optimized to run on devices powered by the new Snapdragon 8 Elite Mobile Platform, Snapdragon Cockpit Elite and Snapdragon Ride Elite, and Snapdragon
Qualcomm Announces Multi-year Strategic Collaboration with Google to Deliver Generative AI Digital Cockpit Solutions
Highlights: Qualcomm and Google will leverage Snapdragon Digital Chassis and Google’s in-vehicle technologies to produce a standardized reference framework for development of generative AI-enabled digital cockpits and software-defined vehicles (SDV). Qualcomm to lead go-to-market efforts for scaling and customization of joint solution with the broader automotive ecosystem. Companies’ collaboration demonstrates power of co-innovation, empowering automakers
“Using Vision Systems, Generative Models and Reinforcement Learning for Sports Analytics,” a Presentation from Sportlogiq
Mehrsan Javan, Chief Technology Officer at Sportlogiq, presents the “Using Vision Systems, Generative Models and Reinforcement Learning for Sports Analytics” tutorial at the May 2024 Embedded Vision Summit. At a high level, sport analytics systems can be broken into two components: sensory data collection and analytical models that turn sensory… “Using Vision Systems, Generative Models
Exploring the Next Frontier of AI: Multimodal Systems and Real-time Interaction
This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. Discover the state of the art in large multimodal models with Qualcomm AI Research In the realm of artificial intelligence (AI), the integration of senses — seeing, hearing and interacting — represents a frontier that is rapidly
Smart Glasses for the Consumer Market
There are currently about 250 companies in the head mounted wearables category and these companies in aggregate have received over $5B in funding. $700M has been invested in this category just since the beginning of the year. On the M&A front, there have already been a number of significant acquisitions in the space, notably the
“Data-efficient and Generalizable: The Domain-specific Small Vision Model Revolution,” a Presentation from Pixel Scientia Labs
Heather Couture, Founder and Computer Vision Consultant at Pixel Scientia Labs, presents the “Data-efficient and Generalizable: The Domain-specific Small Vision Model Revolution” tutorial at the May 2024 Embedded Vision Summit. Large vision models (LVMs) trained on a large and diverse set of imagery are revitalizing computer vision, just as LLMs… “Data-efficient and Generalizable: The Domain-specific
Accelerating LLMs with llama.cpp on NVIDIA RTX Systems
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. The NVIDIA RTX AI for Windows PCs platform offers a thriving ecosystem of thousands of open-source models for application developers to leverage and integrate into Windows applications. Notably, llama.cpp is one popular tool, with over 65K GitHub
“Bridging Vision and Language: Designing, Training and Deploying Multimodal Large Language Models,” a Presentation from Meta Reality Labs
Adel Ahmadyan, Staff Engineer at Meta Reality Labs, presents the “Bridging Vision and Language: Designing, Training and Deploying Multimodal Large Language Models” tutorial at the May 2024 Embedded Vision Summit. In this talk, Ahmadyan explores the use of multimodal large language models in real-world edge applications. He begins by explaining… “Bridging Vision and Language: Designing,
Qualcomm Partners with Meta to Support Llama 3.2. Why This is a Big Deal for On-device AI
This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm. On-device artificial intelligence (AI) is critical to making your everyday AI experiences fast and security-rich. That’s why it’s such a win that Qualcomm Technologies and Meta have worked together to support the Llama 3.2 large language models (LLMs)
Deploying Accelerated Llama 3.2 from the Edge to the Cloud
This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA. Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an updated Llama Guard model with support for vision. When paired with the NVIDIA accelerated