This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm.
Multimodal AI takes in different inputs like text, images or video, allowing digital assistants to better understand the world and you, and gets supercharged when it’s able to run on your device
As smart as generative artificial intelligence (AI) can be, its capabilities are limited by how well it understands everything around it. That’s where large multimodal models (LMMs) come in, which allow AI to analyze voice queries, text, images, videos and even radio frequency and sensor data to provide more accurate and relevant answers.
It’s a critical part of the evolution of generative AI following the now popular Large Language Models (LLMs), such as the one behind the original version of ChatGPT, which were only able to handle text. That enhanced ability to understand what you see and hear will supercharge devices like your smartphone or PC, and make digital assistants and productivity apps much more useful. And being able to handle these operations on the device will make the process faster, more private and power efficient.
Qualcomm Technologies is committed to enabling on-device multimodal AI. Back in February, we were the first to show off Large Language and Vision Assistant (LLaVA), a community-driven LMM with 7+ billion parameters, running on a Snapdragon 8 Gen 3 Mobile Platform-based Android phone for the first time. In this demo, the phone was able to “recognize” images like a dog in an open landscape or a platter of fruits and vegetables — and engage in a conversation. A person could request to create a recipe using the items from the platter, and even ask to estimate the total calories from the recipe. Check it out:
The future of AI is multimodal
This work is critical as the noise around multimodal has gotten louder. Last week, Microsoft introduced its Phi-3.5 family of models which include multi-lingual and visual support. This followed Google talking up LMMs at its Made by Google event, which included Gemini Nano, a model for multimodal inputs. In May, OpenAI introduced its own multimodal model with GPT-4 Omni. This follows similar work from Meta and community-developed models such as LLaVA.
Taken together, these advances shine a light on the path that AI is heading down, one which goes beyond you typing out questions at a prompt. We are committed to bringing these AI experiences to billions of handsets around the world.
Qualcomm Technologies efforts include a wide array of companies developing LMMs and LLMs, including Meta’s Llama series, and is working with Google to enable the next generation of Gemini on Snapdragon. These models run smoothly on Snapdragon, and along with our partners, we are looking forward to delighting consumers with new on-device AI features throughout this and next year.
And while an Android phone is a natural starting point to take advantage of multimodal inputs, the benefits will quickly extend to other categories, from smart glasses able to scan what you eat and provide nutritional information, to cars being able to understand your voice commands and assist you on the road.
Multimodal AI can tackle a lot of complex tasks for you
These are just the first steps for multimodal AI, which could help automobiles recognize those bored passengers in the back during a road trip and suggest fun activities to pass time, using a combination of cameras, microphones and vehicle sensors. It could also enable a pair of smart glasses to recognize gym equipment at a health club and create a customized workout plan for you.
The level of accuracy enabled by multimodal AI will be critical for assisting a field technician troubleshooting what is wrong with your appliances at home, or helping a farmer identify the cause of issues with the crop.
The idea is that these devices — starting with phones, PCs, cars and smart glasses — can take advantage of cameras, microphones and other sensors to let the AI assistant “see” and “hear” so it offers more useful contextual answers.
Importance of on device
All those extra capabilities work better if the AI operations happen on the device, meaning your phone or car needs to be powerful enough to handle those requests. Keeping things on your phone means that trillions of operations should run fast and efficiently, because the battery needs to last all day long. Doing things on the device means you don’t need to ping the cloud and wait for servers when they are too busy to respond. They’re also more private — your questions and the answers stay with you and your device.
That’s been a priority for Qualcomm Technologies. Its Snapdragon 8 Gen 3 processor with its Hexagon NPU enables handsets to handle much of the processing on the phone itself. Likewise, more than 20 Copilot+ PCs on the market today can handle sophisticated AI features on the device thanks to the Snapdragon X Elite and Snapdragon X Plus Platforms.
And we are not standing still. The world of AI is evolving quickly, and your next best opportunity to see where it’s all going will be at Snapdragon Summit in October.