This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm.
In their 1980 book “Metaphors we live by”, cognitive linguists George Lakoff and Mark Johnson famously argue that practically all human thought is governed by metaphors – creative analogies that help explain one thing in terms of something else. By viewing, say, “time” as “money,” we implicitly and automatically allow concepts related to “saving,” “wasting,” or “investing” to govern our thoughts about it. Or by thinking of “theory” as a “building,” we allow concepts like “foundation,” “constructing,” and “support” to structure our discussions.
The influential scholar and author Douglas Hofstadter (an inspiration and role model for many of us over the past decades) has been going even further. Over many years, Hofstadter has argued that human thought is in fact nothing other than “a dozen analogies per second” — in other words, metaphors, in the widest sense, structure every aspect of cognition, all the way from simple everyday activities to deep scientific discoveries. Cognition is analogy making.
Despite its appeal, the influence that analogy making has had on AI in early years has been limited. And it came mostly in the form of computer models that tried to mimic certain aspects of metaphoric thinking in toy examples without a strong impact. It has also been used as a guiding principle in debates around cognition and AI — most famously perhaps, by suggesting embodiment to be an important ingredient to advancing AI (by allowing an AI system to correctly interpret concepts metaphorically linked to physical concepts, as exemplified in the sentence: “She is on top of the situation”).
Analogy making is weight sharing
A possible explanation why metaphors are so prevalent in human thought is that they allow us to share neural circuitry: by recruiting neural firing patterns that are commonly active when you think of “building,” metaphors allow us to share, and make readily available, all that we know about buildings when thinking about any metaphorically related concept, like “theory.” A very similar (arguably, “the same”) kind of sharing is prevalent in machine learning. In fact, one could argue that a variant of Hofstadter’s extreme reading of cognitive metaphor (“a dozen analogies per second”) has been governing almost all aspects of deep learning in the last few decades.
A key problem in deep learning is that models are data hungry. Statistical common sense prescribes that the more parameters a model has the more data we need to train it. This is true of pretty much all kinds of learning, from supervised and self-supervised to reinforcement learning. The only solution to this problem is to keep the number of training examples per parameter large. And neural network researchers have found a widely used solution: weight sharing.
In fact, it is hard to find any neural network that does not make use of weight sharing in one form or another. Convolutional networks, for example, apply a single filter to multiple different locations in an image, resulting in a parameter reduction of several orders of magnitude by comparison to a fully connected network. Recurrent networks share a single set of connections across timesteps. Transfer learning applies part of a network across multiple tasks. Weight sharing is so prevalent that sometimes it hides in plain sight. In fact, any multi-layer neural network has the property that neurons in higher layers share with their peers the activation patterns and synaptic connections of all the layers below. Even deep learning itself can therefore be thought of as a way to implicitly use weight sharing.
Convolution amounts to applying one filter to many locations, while metaphors amount to applying one concept in many contexts.
In the same way that cognitive metaphors are prevalent in human cognition, weight sharing is prevalent in AI. And this may not be incidental. In fact, we can think of both as one and the same thing. And as serving the same simple purpose: statistical efficiency to enable learning.
Machine learning: from feedforward specialists to recurrent generalists
The statistical benefits of sharing can push AI development in directions which are sometimes counter intuitive. Weight sharing – and its ability to improve the statistical efficiency of learning — pushes us toward holistic development and toward building increasingly generalist models. It also pushes us away from reductionist “divide-and-conquer” approaches, that are not only common but even deeply ingrained in engineering culture. It shifts the challenge from analyzing, decomposing, and then building a model for a task to finding ways to generate the data that allows a network to learn any required components and their integration end-to-end by itself.
The trend toward end-to-end learning took off with object and speech recognition around 2010 and the subsequent use of “penultimate layer”-finetuning of pre-trained models. But it is far from concluded and may push neural networks toward significantly higher levels of abstraction and capability in the years to come.
Most visibly, weight sharing is currently fueling a (likely irreversible) long-term trend toward recurrent networks — the harbingers of which are popular large auto-regressive language models. The reason is that a recurrent, or auto-regressive, network can absorb a much wider variety of concepts and capabilities than any feedforward classification or regression model ever can. One way to see this is by considering that an auto-regressive model is trained to incrementally output a sequence not a single class label. And there are combinatorically many instantiations of that sequence — or “labels to draw from” for training. The incredible breadth of tasks that an auto-regressive model can be trained on can be illustrated also by viewing each element in the output sequence as an “action.” This has expanded the source of possible supervision signals to include text, sensory inputs, and even reinforcement learning signals. Viewed from the perspective of conceptual metaphors, this means that the models can learn to leverage connections not just between static concepts or features, but also between dynamic “routines,” strategies, affordances, or “skills” in the widest sense.
We can think of metaphors as a way to exploit “high-level invariances” — things that are constant and do not change. Whereas a convolutional network exploits low-level, spatial invariances by applying one filter to multiple locations in an image, metaphors exploit high-level invariances by applying one “thought process” to multiple different concepts or situations. Cognitive capabilities at a high level of abstraction are referred to as “System-2” capabilities in psychology to contrast them with lower-level perception (or “System-1”). This distinction has been studied in great depth by Nobel laureate Daniel Kahneman, who argues that although System-2 takes on the deliberate, controlling roles, it is really System-1 that is in the driver seat most of the time. Similarly, one could argue that while System-2 thought processes can appear syntactic and mechanical at the surface, it is the use of metaphors and analogies that breathes life into them, by adding meaning, insight, and sometimes creativity.
In our research group at Qualcomm AI Research, we believe that there is a huge opportunity in studying the types of insight and ability to “think” metaphorically that a neural network can acquire. This amounts to carefully choosing the data, tasks and modalities that can elicit potential synergies and connections – and to slowly but steadily raise the level of abstraction at which weight sharing can exert an influence. For example, we are studying how pre-training on language can provide a model with concepts that improve its decision-making ability, or how text-based reasoning can be combined with and help a model better understand a video stream.
Neural networks lack the kind of body and grounding that human concepts rely on. A neural network’s representation of concepts like “pain,” “embarrassment,” or “joy” will not bear even the slightest resemblance to our human representations of those concepts. A neural network’s representation of concepts like “and,” “seven,” or “up” will be more aligned albeit still vastly different in many ways. Nevertheless, one crucial aspect of human cognition, which neural networks seem to master increasingly well, is the ability to uncover deep and hidden connections between seemingly unrelated concepts and to leverage these in creative and original ways. As the level of abstraction rises at which we train our networks, so does the level of capability they surprise and amaze us with.
Roland Memisevic
Senior Director of Engineering, Qualcomm Canada ULC