This blog post is an abridged version of one originally published at Cadence's website. It is reprinted here with the permission of Cadence.
I like to date technical transitions from specific events, even though realistically they take place over an extended period. For example, I think the modern era of IC design and EDA started with the publication of "Mead & Conway" which I wrote about in The Book that Changed Everything.
Today, the most important area of computer science, and semiconductor too, is the huge advances being made on almost a daily basis in neural networks and artificial intelligence. A decade ago, this was a sleepy backwater that had been studied for 50 years. Now it is a new paradigm of "programming" where the system is trained rather than programmed algorithmically. The most highly visible area that this is being used is probably the drive(!) towards autonomous vehicles. However, neural networks are creeping into other less visible areas, such as branch prediction in high-performance microprocessors where they outperform traditional approaches.
For me, the watershed moment was at Yann LeCun's keynote in 2014 at the Embedded Vision Summit. He had a little handheld camera attached to his laptop on the podium, and he pointed it at things that he had up there, like the space-bar, a pen, a cup of coffee, his shoe, and so on. The neural network he was running identified what the camera was looking at, using the NVIDIA GPU in his laptop to power things. I had never seen anything like it. Remember, this was not identifying static images, this was a low-quality camera, not being held steady, pointing at real-world objects, identifying them in real-time. I've seen similar demonstrations since. Indeed, at any trade show where Cadence is focusing on its Tensilica product line, such as the Consumer Electronics Show, we have a similar demonstration running some standard visual recognition algorithms, such as ResNet or Inception (running on a Tensilica processor, of course).
How did "standard vision algorithms" even come into existence?
The 2009 CVPR
The milestone event, in my mind, was a poster session at the 2009 CVPR, the Conference on Computer Vision and Pattern Recognition, by a number of people from the Princeton CS department. The undramatic paper was ImageNet: A Large-Scale Hierarchical Image Database by Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. I assume it was a poster session, rather than a full presentation at the conference because it wasn't considered that important or ground-breaking. In some ways, it wasn't. But it changed everything.
ImageNet was and is a collection of annotated images. The images are all over the internet (lots on Instagram) and don't truly form part of ImageNet itself (and the copyright on the images is owned by whoever put it on the net in the first place). ImageNet consists of the annotations and, in some cases, bounding boxes for the things of interest in the image. The identification in ImageNet was crowdsourced, much of it using Amazon's MechanicalTurk. Today there are over 14 million images. The annotations are basic, along the lines of "there is a cat in this image." There are over 20,000 different categories identified. One focus area is pictures of dogs, where the images are further identified by 120 different dog breeds ("there is a beagle in this image").
In fact, the classification is done using the WordNet hierarchy, which provides some of the knowledge. So if a picture contains, say, a beagle, then ImageNet doesn't also need to explicity identify that it contains a dog, since a beagle is already known to be a dog, and a dog is known to be a mammal.
ILSVRC
In 2010, the ILSVRC was launched, the ImageNet Large Scale Visual Recognition Challenge. Researchers competed to achieve the highest recognition accuracy on several tasks. It uses a subset of the whole database, and only 1000 image categories, but including the dog breeds. In 2010, image recognition was algorithmically based, looking for features like eyes or whiskers. They were not very good, and a 25% or larger error rate was normal. Then suddenly the winning teams were all using convolutional neural networks, and the error rates started to drop dramatically. Everyone switched, and the rates fell to a few percent. The details of ILSVRC changes each year, but it has been run every year since, and continues today.
The dog breeds turned out to be an area where the networks rapidly did better than humans. I can't track it down now, but I read about one researcher who actually trained himself to recognize the different breeds, but even so, the neural networks did even better.
Having a huge dataset to use for driving algorithm development turned out to be the missing jigsaw piece that enabled the rapid and enormous advances in AI that have taken place, especially in the last 5 or 6 years. I've seen estimates that AI has advanced more in the last 3 years than in the decades since the ideas were first toyed with back in the 1950s.
Data
Test data is extraordinarily important. You've probably heard that "data is the new oil" but actually it is processed data that is valuable (like processed oil). There were already millions of pictures on the net, but classifying a few million of them suddenly made them useful.
In the automated driving area, and specialized image recognition, there is the GTSDB, the German Traffic Sign Database. Cadence has (or maybe had, these things change fast) the leading network for identifying the signs, and performs better than humans. If you've not seen the database, you might wonder how a human would ever get anything wrong—traffic signs aren't that hard to identify. But some of them are in fog, at dusk, or covered with dirt, and so on. Yes, the clearest ones are like identifying signs in a driving handbook, but the more obscure ones are barely identifiable at all.
Paul McLellan
Editor, Breakfast Bytes, Cadence Design Systems