István Fehérvári, Director of Data and ML at BenchSci, presents the “Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI” tutorial at the May 2024 Embedded Vision Summit.
Multimodal large language models represent a transformative breakthrough in artificial intelligence, blending the power of natural language processing with visual understanding. In this talk, Fehérvári delves into the essence of these models. He begins by explaining how large language models (LLMs) work at a fundamental level. He then explores how LLMs have evolved to integrate visual understanding, explains how they bridge the language and vision domains and shows how they are trained.
Next, Fehérvári examines the current landscape of multimodal LLMs, including open solutions like LLaVA and BLIP. Finally, he explores applications that will be enabled by deploying these large models at the edge, identifies the key challenges that must be overcome to enable this and highlights what is needed to overcome these challenges.
See here for a PDF of the slides.