The "Wake-Sleep" Algorithm for Unsupervised Neural Networks

Everyone dreams. No one knows why. Science 26 May 1995 has an article titled The "Wake-Sleep" Algorithm for Unsupervised Neural Networks, which suggests both the purpose and the mechanism of dreams.

Unsupervised learning

One of the major outstanding questions about biological neural networks is how they accomplish unsupervised learning. In a supervised learning scheme, an experimenter compares the current output of the network with the desired output, and adjusts the connection weights in the network to minimize the difference.

Of course, there is no experimenter to provide a desired output when a biological neural network learns. Somehow, it does it on its own. The wake-sleep algorithm suggests that the brain dreams in order to generate desired outputs. These outputs are then used to drive the learning process.

The popular account

NPR ran a story on the article when it was published. They described the algorithm like this:

Suppose a child sees a cup. The child doesn't know anything about cups, but its brain stores away the image. Later, the child dreams. The brain takes the image of the cup, and spins off fantasies about it: what it could look like, how it might feel, what it might be used for. These fantasies are also stored. The next time the child sees a cup, its brain compares its current experience to the stored fantasies, and preferentially keeps the fantasies that agree with experience. In this way, the brain builds up categories that abstract the essential elements of the world around it.

The real scoop

I was interested enough by this account to read the article. The article turns out to be highly technical, and to be motivated by considerations besides dreams. I can present here only a sketch; if you want to understand it further, you'll need to read it yourself.

Helmholtz machines

The premise of the article is that a neural net should represent its stimulus with a minimal-length description. To do this, the network maintains two sets of connections: recognition connections, which run from inputs to outputs, and generative connections, which run from outputs back to inputs. The authors call such networks Helmholtz machines.

When a stimulus is applied to the network, the recognition connections map it to the output category that best describes it. Then, the generative connections that run from that particular output back to the inputs create—generate—an image of that category at the input neurons. The image of the category is subtracted from the actual stimulus. Thus, the network represents any stimulus by specifying an output category, together with the differences between the actual stimulus and the nominal stimulus for that category. This jibes with our experience of the world: we see what is really out there, even if it doesn't exactly match one of our existing categories.

Optimization

To minimize the expected length of the description, the output categories should lie close to frequently encountered stimuli. That way, the number of bits required to represent the differences between the actual stimulus and the nominal stimulus will—on average—be small. But there are tradeoffs. Increasing the number of output categories reduces the number of bits necessary to represent the differences, but increases the number of bits necessary to specify a particular output category. The actual optimum depends, of course, on the statistics of the input stimuli.

Stochastic neurons

One thing to understand is that all the neurons being discussed here are stochastic neurons. This means that the inputs to a neuron do not determine its output. Rather, they determine the probability distribution of its output. In a network of stochastic neurons, a meaningful output can be obtained only by averaging neural activity over many neurons or many firing cycles.

Free energy

The article asserts that the statistics involved in minimizing description lengths are formally the same as the statistics involved in minimizing the free energy of a thermodynamic system. This isn't completely astonishing, considering the deep connection between the entropy of thermodynamics and the entropy of information.

Training

In order to train a Helmholtz machine, you have to adjust both sets of connection weights: the recognition weights and the generative weights.

The generative weights can be adjusted while the network is in use (awake), using a simple gradient-descent technique. In essence, you adjust the generative weights to minimize the difference between the actual input and the image of the input that they create. In this case, the actual input serves as the desired output for adjusting the weights.

The recognition weights cannot be adjusted while the network is in use. The problem isn't that the network is in use, but that there isn't any desired output to drive the adjustment. However, while the network is not in use (asleep), desired outputs can be generated by randomly activating output neurons, and then using the generative weights to create simulated inputs (dreams). The recognition weights then map the simulated input to an output category.

Since these are stochastic neurons, activating a given output leads to a distribution of simulated inputs. This distribution has the statistics required to train the recognition weights. For each simulated input, the recognition weights are adjusted to maximize the probability of mapping that input back to the output neuron that was activated in order to generate it.

In summary, while the network is awake, the generative weights are adjusted to better model real inputs. While the network is asleep, the recognition weights are adjusted to better recognize the dreams created by the generative weights.

Sound bite

Dreams are random images that the brain creates so that it can practice recognizing them.

Experiment

So you run the network in wake-sleep cycles, and it learns. The authors got a big collection of handwritten digits from the post office. They fed the digits to the network, and it learned to recognize them—without anyone telling it what to look for.

They also printed out the network's dreams—the simulated input patterns. Images of the 10 digits appeared in the dreams. What's more, the images were not idealized representations of the digits. Rather, they were all variations on the ideal, and the variations in the dreams very much resembled the variations found in the actual input data. A sample of the input and a sample of the dreams looked pretty much the same.

One last point of concordance between this theory and observation is the bidirectional neural connections. It is well-observed that many parts of the brain have connections running both from inputs to outputs and from outputs to inputs. The purpose of connections from inputs to outputs seems obvious, but there has not been any good account of connections from outputs to inputs. The wake-sleep algorithm would provide one.

Wild speculation

On this theory, you could start to understand why people hallucinate when they aren't allowed to dream. If you don't dream, then your brain can't keep its perceptual categories optimized for your current reality. And a divergence between perception and reality is essentially the definition of a hallucination.

You could also start to understand the experience of dreaming: a dream would appear to be the subjective experience of your brain running backwards.

Notes

everyone: All mammals, actually
article: G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal, Science 268, 1158 (1995)
Helmholtz machines: Helmholtz was an early advocate of the idea that the perceptual system uses generative models.
jibes: Students of philosophy, psychology, politics and sociology spend much time demonstrating that our perceptual categories do affect our perceptions. This may well be true, especially for the very abstract categories which they study. However, for the vast amount of ordinary, second-to-second stimuli with which we must contend, our brains do an admirable job of presenting us with both the raw data and the categories into which it falls. For example, no matter what font you render this document in, you see the actual shape of each glyph, while still apprehending the letter for which the glyph stands.
averaging: If nothing else, you can imagine that this would lead to a very robust system. If the whole network is designed to run by averaging over neurons, and one of them dies, or there's some noise somewhere, well, who's going to notice?
post office: The post office is very interested in handwriting recognition, because they want to read zip codes by machine.
hallucinate: Someone once told me that they become "psychotic" when they haven't had enough sleep.

Translations

Azerbaijanian translation courtesy of Amir Abbasov
Bulgarian translation courtesy of Zlatan Dimitrov
Spanish translation courtesy of Laura Mancini

Steven W. McDougall / resume / swmcd@theworld.com / 1997 November 1