The "Wake-Sleep" Algorithm for Unsupervised Neural Networks
Everyone dreams. No one knows
why. Science 26 May 1995 has an article titled The "Wake-Sleep" Algorithm for
Unsupervised Neural Networks, which suggests both the purpose and
the mechanism of dreams.
Unsupervised learning
One of the major outstanding questions about biological neural
networks is how they accomplish unsupervised learning. In a
supervised learning scheme, an experimenter compares the
current output of the network with the desired output, and adjusts the
connection weights in the network to minimize the difference.
Of course, there is no experimenter to provide a desired output
when a biological neural network learns. Somehow, it does it on its
own. The wake-sleep algorithm suggests that the brain dreams in order
to generate desired outputs. These outputs are then used to drive
the learning process.
The popular account
NPR ran a story on the article when
it was published. They described the algorithm like this:
Suppose a child sees a cup. The child doesn't know anything about
cups, but its brain stores away the image. Later, the child
dreams. The brain takes the image of the cup, and spins off fantasies
about it: what it could look like, how it might feel, what it might be
used for. These fantasies are also stored. The next time the child
sees a cup, its brain compares its current experience to the stored
fantasies, and preferentially keeps the fantasies that agree with
experience. In this way, the brain builds up categories that abstract
the essential elements of the world around it.
The real scoop
I was interested enough by this account to read the article. The
article turns out to be highly technical, and to be motivated by
considerations besides dreams. I can present here only a sketch; if
you want to understand it further, you'll need to read it yourself.
Helmholtz machines
The premise of the article is that a neural net should represent its
stimulus with a minimal-length description. To do this, the network
maintains two sets of connections: recognition connections,
which run from inputs to outputs, and generative connections,
which run from outputs back to inputs. The authors call such networks
Helmholtz machines.
When a stimulus is applied to the network, the recognition connections
map it to the output category that best describes it. Then, the
generative connections that run from that particular output back to
the inputs create—generate—an image of that category at the input
neurons. The image of the category is subtracted from the actual
stimulus. Thus, the network represents any stimulus by specifying an
output category, together with the differences between the actual
stimulus and the nominal stimulus for that category. This jibes with our experience of the world: we see what
is really out there, even if it doesn't exactly match one of our
existing categories.
Optimization
To minimize the expected length of the description, the output
categories should lie close to frequently encountered stimuli. That
way, the number of bits required to represent the differences between
the actual stimulus and the nominal stimulus will—on average—be
small.
But there are tradeoffs. Increasing the number of output categories
reduces the number of bits necessary to represent the differences, but
increases the number of bits necessary to specify a particular output
category.
The actual optimum depends, of course, on the statistics of the input
stimuli.
Stochastic neurons
One thing to understand is that all the neurons being discussed here are
stochastic neurons. This means that the inputs to a neuron do not
determine its output. Rather, they determine the probability
distribution of its output. In a network of stochastic neurons, a
meaningful output can be obtained only by averaging neural activity over many neurons or
many firing cycles.
Free energy
The article asserts that the statistics involved in minimizing
description lengths are formally the same as the statistics involved
in minimizing the free energy of a thermodynamic system. This isn't
completely astonishing, considering the deep connection between the
entropy of thermodynamics and the entropy of information.
Training
In order to train a Helmholtz machine, you have to adjust both sets of
connection weights: the recognition weights and the generative
weights.
The generative weights can be adjusted while the network is in use
(awake), using a simple gradient-descent technique. In essence, you
adjust the generative weights to minimize the difference between the
actual input and the image of the input that they create. In this
case, the actual input serves as the desired output for adjusting the
weights.
The recognition weights cannot be adjusted while the network is in
use. The problem isn't that the network is in use, but that there
isn't any desired output to drive the adjustment.
However, while the network is not in use (asleep), desired outputs can
be generated by randomly activating output neurons, and
then using the generative weights to create simulated inputs
(dreams). The recognition weights then map the simulated input to an
output category.
Since these are stochastic neurons, activating a given output leads to
a distribution of simulated inputs. This distribution has the
statistics required to train the recognition weights.
For each simulated input, the recognition weights are adjusted to
maximize the probability of mapping that input back to the output
neuron that was activated in order to generate it.
In summary, while the network is awake, the generative weights are
adjusted to better model real inputs. While the network is asleep, the
recognition weights are adjusted to better recognize the dreams
created by the generative weights.
Sound bite
Dreams are random images that the brain creates so that it can
practice recognizing them.
Experiment
So you run the network in wake-sleep cycles, and it learns. The
authors got a big collection of handwritten digits from the post office. They fed the digits to the
network, and it learned to recognize them—without anyone telling it
what to look for.
They also printed out the network's dreams—the simulated input
patterns. Images of the 10 digits appeared in the dreams. What's more,
the images were not idealized representations of the digits. Rather,
they were all variations on the ideal, and the variations in the
dreams very much resembled the variations found in the actual input
data. A sample of the input and a sample of the dreams looked pretty
much the same.
One last point of concordance between this theory and observation is
the bidirectional neural connections. It is well-observed that many
parts of the brain have connections running both from inputs to
outputs and from outputs to inputs. The purpose of connections from
inputs to outputs seems obvious, but there has not been any good
account of connections from outputs to inputs. The wake-sleep
algorithm would provide one.
Wild speculation
On this theory, you could start to understand why people hallucinate when they aren't allowed to
dream. If you don't dream, then your brain can't keep its perceptual
categories optimized for your current reality. And a divergence
between perception and reality is essentially the definition of a
hallucination.
You could also start to understand the experience of dreaming: a dream
would appear to be the subjective experience of your brain running
backwards.
Notes
- everyone
- All mammals, actually
- article
- G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal,
Science 268, 1158 (1995)
- Helmholtz machines
- Helmholtz was an early advocate of the idea that the perceptual
system uses generative models.
- jibes
- Students of philosophy, psychology, politics and sociology spend
much time demonstrating that our perceptual categories do affect our
perceptions. This may well be true, especially for the very abstract
categories which they study. However, for the vast amount of ordinary,
second-to-second stimuli with which we must contend, our brains do an
admirable job of presenting us with both the raw data and the
categories into which it falls. For example, no matter what font you
render this document in, you see the actual shape of each glyph, while
still apprehending the letter for which the glyph stands.
- averaging
- If nothing else, you can imagine that this would lead to a very
robust system. If the whole network is designed to run by averaging
over neurons, and one of them dies, or there's some noise somewhere,
well, who's going to notice?
- post office
- The post office is very interested in handwriting recognition,
because they want to read zip codes by machine.
- hallucinate
- Someone once told me that they become "psychotic" when they
haven't had enough sleep.
Translations
Azerbaijanian translation courtesy of Amir Abbasov
Bulgarian translation courtesy of Zlatan Dimitrov
Spanish translation courtesy of Laura Mancini
Steven W. McDougall /
resume /
swmcd@theworld.com /
1997 November 1