This blog post was originally published by GMAC Intelligence. It is reprinted here with the permission of GMAC Intelligence.
Humans mainly rely on speech, vision and touch to operate efficiently and effectively in the physical world. We also rely on smell and taste for our activities and our survival as well, but for most of us, the utility of the latter senses is limited to a few basic repetitive and intermittent tasks. Touch or the ability to manipulate the physical world around us needs stimulus from other senses such as vision and speech.
Vision and speech are independent and dynamic, we are both producers and consumers of speech (text or audio) Â and vision signals. We are limited in our capacity to understand both vision and speech signals. This limitation comes in mainly two forms, our limited skills and limited attention e.g., most of us do not understand all languages (for purpose of both speaking/hearing and/or reading/writing) and we also have a limited field of view (FOVEA) in our eyes so we can only focus our attention on a narrow sliver of reality around us. Our skills further limit our understanding of the small fraction that we are able pay attention to through our eyes and ears. We usually acquire new skills or perform new tasks when we interact with our surroundings through vision and speech: there is no other way today.
As the complexity of our world and work increases, we need ever-increasing attention to both vision and speech signals around us, but attention is in limited supply due to our biology. So how do we deal with the situation, when we are born with a fixed amount of attention? Here’s one way: the things for which we have acquired a skill for or developed an understanding of can be offloaded to an assistant. The assistant pays attention to or infers from those speech and vision signals and performs the tasks that we were supposed to do  e.g., driving cars, cooking food, watching over our house/assets, manage reception area of our business. The assistant may also seek our advice (prompt) for achieving our objective, e.g., where to go, what to eat etc. These assistants can be digital and also help in more complex tasks e.g., coding, preparing a PowerPoint slide on a topic, generating new art , new music etc based on a prompt given by us. By doing so, we can focus our attention on more rewarding or productive activities.
These digital assistants need energy to pay attention and infer from the speech and vision signals, which corresponds to the opex and they also need to embodied either as robot or computer which corresponds to the capex. Both these costs manifests as $/inference to the assisted. As a customer of this digital inferencing service, it makes sense to go for the lowest $/inference, because that is usually also the lowest energy/inference and a green choice.
For the sake of this discussion, let’s assume a 10 second clip of HD video (capturing both vision and speech) that has to be processed by this digital assistant (e.g., who manages a reception area to handle customers). This is roughly 10MB of data. There are currently three ways to process this signal: on the cloud, on the edge, or hybrid cloud + edge.  If these 10 second clips are already stored and available in the cloud (e.g., in the form of text, video), then processing on the edge does not make sense. So, for purpose of this comparison discussion, we will only consider those signals that are created in real-time in our physical environment. In order to create this digitized version of vision or speech signal, some edge sensor hardware is needed for all the above options. The additional cost of intelligence compute on the edge to process small to medium complexity tasks has gone down to a few dollars. If all vision and speech is processed on edge, we don’t incur any costs of transporting data to cloud, storing data on cloud, processing data on cloud. The typical cost for transporting, storing (~2 yrs) and processing (decoding video/audio + ai processing) this 10s clip of compressed audio/video is about $0.01. ($0.005 +  $0.0025 + $0.0025).
This means your cloud costs are ~$86 a day (24 hrs =Â 86400 seconds). Why would you like to bleed $s continuously, when you can process on edge hardware for a few hundred dollars of one-time cost. Seems obvious, over a 10-year period, this is ~1000x savings. ($315K vs $315).
However, there is a caveat. These cloud cost calculations assume that you are using cloud intelligence all 86,400 seconds of a day. Â If you needed intelligence intermittently, e.g., less than 1.5 minutes a day, then cloud costs are less than edge hardware investment.
Decision makers should take this 1.5 minute intelligence heuristic into account when they are deciding between cloud vs edge for AI processing of vision signals from a cost perspective. It’s important to remember that emotional intelligence in digital assistants can only be unlocked via vision, so it’s a must-have component.
Another key takeaway is that if 10 seconds of speech is converted to text on the edge, it’s less than 1KB, which is 1/10000x of the video data (10 MB). Now cloud processing of this reduced text data to extract actionable intelligence will be very cost-effective, espescially via LLMs which are hard to deploy on edge today. Thus, a hybrid edge + cloud solution for speech signals makes it more cost effective without losing capability.
Vision-AI and speech-text conversion(s) on edge and language-AI on cloud is a pragmatic “hybrid” solution for implementing intelligent digital assistants.
Other factors such as privacy, security, response-time and AI-API costs are not considered and assumed to be equalized by proper design. Today’s mid-range Android smart-phone costing about ~$300 has sufficient AI compute to perform tasks of a basic intelligent digital assistant.
Amit Mate
Founder and CEO, GMAC Intelligence