This blog post was originally published in the late May 2017 edition of BDTI's InsideDSP newsletter. It is reprinted here with the permission of BDTI.
Remember when mobile phones were for making phone calls? Given today’s reality, it can be difficult to recall the time – not so long ago – when mobile phones had one purpose: making phone calls. Today, the situation is very different; most people use their phones mainly for sending texts, reading email and news, social networking, navigating, shopping and watching videos. And maybe – rarely – making a phone call.
Video cameras are on a similar path: soon, most video cameras will not actually record or transmit video. That’s the bold prediction from Michael Tusch, former CEO of Apical (a mobile imaging company acquired last year by ARM), in his fascinating presentation at the recent Embedded Vision Summit.
Note that Michael is not referring to camcorders (which for consumer use have largely been displaced by mobile phones), but rather to “IP cameras” – that is, network-connected video cameras like the Nest Cam that are typically used for security and monitoring applications.
There are several compelling arguments in favor of Michael’s view. First, Michael points out that it’s simply not practical to transmit or store the video from all of these cameras. Let’s consider only the 120 million IP cameras placed into service just in 2015. If all of these cameras were connected to the Internet and operated around the clock like the Nest Cam, they would generate 400 exabytes (4 x 1020 bytes) of data per month – roughly four times the volume of all Internet traffic today. Storing the video from these cameras would require a capacity of about 3,000 times the current size of YouTube. And keep in mind, these are just the cameras placed into service in 2015. We can expect many more to be deployed, considering that a name-brand IP camera can now be purchased for less than $50.
Even if we were able to find a way to economically transmit and store all of this video, it’s clear that the vast majority of it would never be seen by human eyes; there simply aren’t enough human eyes to watch it. In another insightful presentation, Chris Rowen, founder of Tensilica and now CEO of Cognite Ventures, points out that by 2015, the number of deployed image sensors exceeded the world population – and cameras are multiplying much faster than people these days.
Finally, even if there were enough people available to watch all of the video from these cameras, employing them to do so would not be effective or economical. (It also might be considered a form of torture, but that’s just my opinion.) Not to mention the significant privacy concerns with such an approach.
The solution? Rather than transmitting and storing video for human consumption, video should be processed locally by algorithms, and the results from that analysis passed along for human consumption, or to enable machines to do useful things.
In addition to IP cameras, image sensors are being deployed in many other applications where this is the only sensible approach. Take cars, for example. According to market research firm Yole Développement, “From less than one camera per car on average in 2015, there will be more than three cameras per car by 2021.” Except for certain very specific use cases (such as parking), clearly we can’t have drivers watching video streams from cameras.
Fortunately, at the same time that image sensors and cameras are proliferating by the billions, our ability to process video locally to extract relevant information is advancing rapidly. This is due mainly to improvements in algorithms and processors.
Thanks to deep neural networks, the accuracy of algorithms has improved dramatically for a wide range of visual perception tasks, from face recognition to lip reading. And through the efforts of university researchers and engineers at dozens of companies, specialized vision and neural network processors are becoming available – and affordable. These processors are able to deliver the massive computational performance required for sophisticated vision algorithms while fitting into the tight cost, power and size budgets of devices like IP cameras.
These rapid advances in robust algorithms and efficient processors mean that we can process video close to the image sensor, transmitting or storing only the key information extracted from the video – good news, since we don’t have a practical way to transmit or store the video itself. But more important is the value that visual perception is bringing to edge devices. Enabling machines to see allows them to be safer, more autonomous, easier to use, and more efficient. And as a bonus, processing video at the edge can enhance privacy.
So I think Michael Tusch is right: by 2030, most video cameras will not stream video – just like most phones aren’t really used to make phone calls anymore. What an amazing era we live in!
If you’re developing vision algorithms or applications, check out, the new full-day, hands-on training class, “Deep Learning for Computer Vision with TensorFlow,” presented by the Embedded Vision Alliance in Santa Clara, California on July 13th, and in Hamburg, Germany on September 7th. For details, visit the event web page.
Jeff Bier
Co-Founder and President, BDTI
Founder, Embedded Vision Alliance