Sally Ward-Foxton, Senior Reporter at EE Times, moderates the “Multimodal LLMs at the Edge: Are We There Yet?” Expert Panel at the May 2024 Embedded Vision Summit. Other panelists include Adel Ahmadyan, Staff Engineer at Meta Reality Labs, Jilei Hou, Vice President of Engineering and Head of AI Research at Qualcomm Technologies, Pete Warden, CEO of Useful Sensors, and Yong Jae Lee, Associate Professor in the Department of Computer Sciences at the University of Wisconsin-Madison and CEO of GivernyAI.
Large language models (LLMs) are fueling a revolution in AI. And, while chatbots are the most visible manifestation of LLMs, the use of multimodal LLMs for visual perception—for example, vision language models like LLaVA that are capable of understanding both text and images—may ultimately have greater impact given that so many AI use cases require an understanding of both language concepts and visual data, versus language alone.
To what extent—and how quickly—will multimodal LLMs change how we do computer vision and other types of machine perception? Are they needed for real-world applications, or are they a solution looking for a problem? If they are needed, are they needed at the edge? What will be the main challenges in running them there? Is it the nature of the computation, the amount of computation, memory bandwidth, ease of development or some other factor? Is today’s edge hardware up to the task? If not, what will it take to get there?
This lively and insightful panel discussion answers these and many other questions around the rapidly evolving role of multimodal LLMs in machine perception applications. The panelists have firsthand experience with these models and the challenges associated with implementing them at the edge.