This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm.
In the rapidly evolving world of generative artificial intelligence (GenAI), the focus has traditionally been on large, complex models that require significant computational resources. However, a new trend is emerging: the development and deployment of small, efficient and accurate models that run directly on devices rather than in the cloud. This shift toward on-device AI not only makes GenAI more accessible but also addresses critical concerns such as privacy, security, cost and safety.
In March 2023, OpenAI unveiled GPT-3.5 Turbo, a large language model (LLM) boasting 175 billion parameters and serving as the foundation for the original ChatGPT. However, within about 18 months, the Meta Llama 3 family made its debut, featuring an 8-billion parameter model that is just 1/20th the size of GPT-3.5 Turbo but delivers comparable, if not superior, performance. This isn’t an isolated incident; it’s part of a broader trend of continuous improvement that is reshaping the AI landscape. For example, the Llama 3.3 70B model was released today and matches the performance of the Llama 3.1 405B model that came out in June 2024. In just 6 months, the model size has been reduced by a factor of 6 while maintaining the same level of performance.
Why does this matter? The decreasing size of AI models, coupled with their improving quality, opens a world of possibilities for on-device AI. Devices today are equipped with substantial computational power, making them capable of handling complex AI inference workloads in an energy-efficient and high-performance manner. This capability ensures that AI applications can operate seamlessly — supporting a wide variety of tasks such as text creation and summarization, photo and video editing, code generation, live translation, AI assistants and more.
A diverse ecosystem of models
The trend is evident across the broader ecosystem. Models like Meta’s Llama family, Google’s Gemma models, and Mistral’s Ministral models are all becoming smaller and more efficient. The quality of these models continues to improve, making them not just comparable but often superior to their larger predecessors.
This trend allows for the development of compelling use cases with small models on devices. Whether it’s personalized AI experiences, real-time language processing, or advanced image recognition, the capabilities of these models are expanding. By running AI locally, data remains within the device, addressing critical concerns about privacy and safety.
There’s another important factor to consider: cost. Running generative AI inference solely on the cloud is becoming expensive, making it difficult to scale. Third-party reports have highlighted this issue, indicating that the marginal cost of cloud-based inference is rising. Instead of relying solely on the cloud, a distributed approach where inference is run across the network and on devices offers a much more cost-effective solution. In other words, train in the cloud, inference on edge devices. This approach significantly reduces the marginal expense cost of running a large and diverse set of models.
Moreover, the capabilities of these smaller models are not just about size and cost. They are also becoming more sophisticated. Usage of longer context length allows for more thoughtful and nuanced responses, enhancing the user experience. Additionally, the number of modalities is expanding beyond just text. These models can now handle voice, images, video, radar, lidar and infrared sensors, and using agentic AI orchestration to tap into multiple models and a personal knowledge graph for a far more personalized and multi-faceted experience.
Safety is a critical component in the development of AI, which is why Qualcomm Technologies has been actively leading and contributing within the AI engineering consortium MLCommons. Our significant contribution to the newly released AILuminate v1.0 benchmark marks a significant step forward to assess the safety of general-purpose AI chat models.
The importance of safe AI interactions
As AI models, particularly those involved in text-to-text interactions, become more ubiquitous in devices, the potential for misuse or harmful interactions increases. These models can inadvertently promote or engage in dangerous behaviors. MLCommons’ AILuminate v1.0 benchmark is specifically designed to assess the safety of general-purpose AI chat models, focusing on their responses to various user prompts that could potentially have malicious or vulnerable intent.
The AILuminate benchmark evaluates AI models on their ability to handle a wide range of hazards, including physical hazards like violent crimes, non-physical hazards such as defamation and privacy violations and contextual hazards like unqualified advice. The goal is to ensure that AI models do not perpetuate or escalate these risks when interacting with users.
According to MLCommons’ initial results, which took an assortment of models ranging from large, cutting-edge models to smaller, efficient models, some smaller models performed very well — scoring a “Good” to “Very Good” in many instances.
Qualcomm Technologies is a founding member of MLCommons, contributing significantly to the AI Risk and Reliability working group through group leadership, technical and financial participations. Our involvement with MLCommons underscores our commitment to responsible development of AI.
The release of the AILuminate v1.0 benchmark is a significant step towards guiding the development of AI models in a safer and more beneficial way for all users. This work sets a standard for the industry, promoting a future where AI enhances our lives without compromising on responsibility.
Durga Malladi
SVP and GM, Technology Planning and Edge Solutions, Qualcomm Technologies