of Ray Solomonoff

Grace Solomonoff; Box 400404;

Cambridge, Mass. 02140; U.S.A.

trovaxo@yahoo.com

Theory and Applications

We first define Algorithmic Probability, an extremely powerful method of inductive inference. We discuss its completeness, incomputability, diversity and subjectivity and show that its incomputability in no way inhibits its use for practical prediction. Applications to Bernoulli sequence prediction and grammar discovery are described. We conclude with a note on its employment in a very strong AI system for very general problem solving.

in Generating the Universal Probability Distribution

In order to generate a universal probability distribution to extrapolate a binary string x of length i , we feed random bits into a universal device, M . When we find an input string that gives an output matching x , we continue the successful input with random bits until M produces a zero or one as output. The relative probabilities of these two continuations can give a normalized prediction for the probability of the symbol following x . There is, however, a probability, P_{i+1}(u) that the continued random input string will not generate any output for the (i+1)th symbol. We will show that the expected value of the sum of these P_i(u)'s is <= kln2, k being the Kolomogorov complexity with respect to M, of the distribution that generated x.

I will first discuss current work in machine learning -- in particular, feedforeword Artificial Neural Nets (ANN), Boolean Belief Nets (BBN), Support Vector Machines (SVM), Radial Basis Functions (RBF) and Prediction by Partial Matching (PPM). While they work quite well for the types of problems for which they have been designed, they do not use recursion at all and this severely limits their power.

Among techniques employing recursion, Recurrent Neural Nets, Context Free Grammar Discovery, Genetic Algorithms, and Genetic Programming have been prominent.

I will describe the Universal Distribution, a method of induction that is guaranteed to discover any describable regularities in a body of data, using a relatively small sample of the data. While the incomputability of this distribution has sharply limited its adoption by the machine learning community, I will show that paradoxically, this incomputability imposes no limitation at all on its application to practical prediction.

My recent work has centered mainly on two systems for machine learning. The first might be called "The Baby Machine" We start out with the machine having little problem specific knowledge, but a very good learning algorithm. At first we give it very simple problems. It uses its solutions to these problems to devise a probability distribution over function space to help search for solutions to harder problems. We give it harder problems and it updates its probability distribution on their solutions. This continues recursively, solving more and more difficult problems.

The task of writing a suitable training sequence has been made much easier by Moore's Law, which gives us enough computer speed to enable large conceptual jumps between problems in the sequence.

A more difficult task is that of updating the probability distribution when new problems are solved. I will be using some of the current techniques for machine learning in this area: PPM and Grammar Discovery are particularly promising. It is my impression that the probability models need not be recursive for the initial learning. Later the system can, with suitable training, work on the problem of updating its own probability distribution, using fully recursive models.

Genetic Programming is my second area of recent research. Koza's Genetic Programming system has been able to solve many difficult problems of very varied kinds. The system itself is extremely slow, however -- it has taken a cluster of 1000 Pentiums about a month to solve some of the more difficult problems. I have found a large number of modifications of the system that I expect will dramatically improve its speed and broaden the scope of problems it can solve.

The Future --- The cost halving constant in Moore's law is now about 18 months. When A. I. pays a significant role in the reduction of this time constant, we begin to move toward the singularity. At the present time I believe we have a good enough understanding of machine learning, for this to take place. While I hesitate to guess as to just when the singularity will occur, I would be much surprised if it took as much as 20 years. As for the next 50 years of A.I., I feel that predicting what will happen after this singularity is much like guessing how the universe was before the Big Bang -- It's a real discontinuity!

Preliminary Report for NIPS 2002 Workshop

We will describe recent developments in a system for machine learning that we've been working on for some time (Sol 86, Sol 89). It is meant to be a ``Scientist's Assistant'' of great power and versatility in many areas of science and mathematics. It differs from other ambitious work in this area in that we are not so much interested in knowledge itself, as we are in how it is acquired - how machines may learn. To start off, the system will learn to solve two very general kinds of problems. Most, but perhaps not all problems in science and engineering are of these two kinds.

The first kind is Function Inversion. These are the P and NP problems of computational complexity theory. They include theorem proving, solution of equations, symbolic integration, etc.

The second kind of problem is Time Limited Optimization. Inductive inference of all kinds, surface reconstruction, and image restoration are a few examples of this kind of problem. Designing an automobile in 6 months satisfying certain specifications and having minimal cost, is another.

In the following discussion, we will be using the term ``Probability'' in a special sense: i.e. the estimate given by the best probabilistic model for the available data that we can find in the available time.

Our system starts out with a small set of Problem Solving Techniques (PST's) and a simple General Conditional Probability Distribution (GCPD). When the system is given a problem, the description of this problem is the ``Condition'' for the GCPD. Its output is a probability distribution on PST's - the likelihood that each of them will be the best way to solve the problem. It uses these PST's and their associated probabilities to solve the first problem.

Next, it executes its Update Algorithm: The PST's are modified, new ones may be added, some may be deleted. The GCPD is modified. These changes attempt to incorporate into the system, information obtained in solving recent problems and prepare it to solve harder problems.

The next problem presentation and the system's solution is followed by the corresponding update and so on. After the nth update, the system is usually able to work problems more difficult than the nth. Giving the system a suitable training sequence of problems of progressive difficulty, makes it possible for the system to eventually solve very hard problems.

One critical component in the system is the initial set of PST's. These include Levin's Universal Search Algorithm (Lsearch). Simple Lsearch is only practical for very small problems, since its solution time is exponential in problem size. In the update phase, we modify the probability distribution that is used to guide Lsearch. This effectively reduces the sizes of more difficult problems, so that Lsearch becomes feasible for them.

The update algorithm is another critical component. It has been designed to facilitate ``transfer learning'', so that the system can utilize any similarities between problems in the same or disparate domains. It has been possible to express this algorithm as an inductive inference problem of a type the system normally solves - so we can simply ask the system to update itself as one of its regular problems.

Perhaps the most critical component of all is the training sequence - To devise a sequence of problems so that each of them challenges the system to devise new concepts to solve it, yet keeps the challenges within the capacity of the machine. For PST's that use simple Lsearch, this is not hard to do. It is always possible to get a good estimate of the time needed for the system to find a known solution to a problem. For certain modifications of Lsearch this is not so easy and designing training sequences for them can become as difficult as teaching people!

The system's generality invites comparison with Genetic Programming. We will discuss differences in the two approaches to optimization problems, and suggest a way in which Genetic Programming might be made a part of the present system.

Another way to look at the system is to think of it as a
``Wrapper'' system - a system that modifies its parameters in view of
its experience with earlier problem solving. The present system is
an ``*Exteme* Wrapper System'' in that - through the Magic of
Levin's Universal Search - it is possible for the entire system to
effectively replace itself by one that is more efficient.

Problems in probabilistic induction are of two general kinds. In the first, we have a linearly ordered sequence of symbols that must be extrapolated. In the second we want to extrapolate an unordered set of finite strings.

A very general formal solution to the first kind of problem is well known and much work has been done in obtaining good approximations to it [ Sol 64a, Sol 64b,Wal 68, Sol 78, Ris 78, Ris 89, Li 93].

Though the second kind of problem is of much practical importance, no general solution has been published. We present two general solutions for unordered data.

We also show how machines can be constructed to summarize sequential and unordered data in optimum ways.

We will begin with a definition of Algorithmic Probability (ALP), and discuss some of its properties. From these remarks it will become clear that it is extremely effective for computing probabilities of future events - the best technique we have. As such, it gives us an ideal theoretical solution to the problem of inductive inference. I say ``theoretical'' because any device as accurate as ALP must necessarily be incomputable.

For practical induction we use a set of approximations of
increasing power to approach ALP. This set is called Resource
Bounded Probability (RBP), and it constitutes a general solution to
the problem of *practical* induction. Some of its
properties are quite different from those of ALP.

The rest of the paper will discuss philosophical and practical implications of the properties of ALP and RBP.

It should be noted that the only argument that need be
considered for the use of these techniques in induction is their
effectiveness in getting good probability values for future events.
Whether their properties are in accord with our
intuitions about induction is a peripheral issue. The main
question is ``Do they work?'' and we show that they *do
work*.

Algorithmic Probability

We have employed Algorithmic Probability Theory to construct a system for machine learning of great power and generality. The principal thrust of present research is the design of sequences of problems to train this system.

Current programs for machine learning are limited in the kinds of concepts accessible to them, the kinds of problems they can learn to solve, and in the efficiency with which they learn - both in computation time needed and/or in amount of data needed for learning.

Algorithmic Probability Theory provides a general model of the learning process that enables us to understand and surpass many of these limitations.

Starting with a machine containing a small set of concepts, we use a carefully designed sequence of problems of increasing difficulty to bring the machine to a high level of problem solving skill.

The use of training sequences of problems for machine knowledge acquisition promises to yield Expert Systems that will be easier to train and free of the brittleness that characterizes the narrow specialization of present day systems of this sort.

It is also expected that the present research will give needed insight in the design of training sequences for human learning.

to Problems in Pattern Recognition and Artificial Intelligence

Recent results in induction theory are reviewed that demonstrate the general adequacy of the induction system of Solomonoff and Willis. Several problems in pattern recognition and A.I. are investigated through these methods. The theory is used to obtain the a priori probabilities that are necessary in the application of stochastic languages to pattern recognition. A simple, quantitative solution is presented for part of Winston's problem of learning structural descriptions from examples. In contrast to work in non-probabilistic prediction, the present methods give probability values that can be used with decision theory to make critical decisions.

A General Theory

Of Inductive Inference

Some preliminary work is presented on a very general new theory of inductive inference. The extrapolation of an ordered sequence of symbols is implemented by computing the apriori probabilities of various sequences of symbols. The apriori probability of a sequence is obtained by considering a universal Turing machine whose output is the sequence in question. An approximation to the apriori probability is given by the shortest input to the machine that will give the desired output. A more exact formulation is given, and it is made somewhat plausible that extrapolation probabilities obtained will be largely independent of just which universal Turing machine was used, providing that the sequence to be extrapolated has an adequate amount of information in it.

Some examples are worked out to show the application of the method to specific problems. Applications of the method to curve fitting and other continuous problems are discussed to some extent. Some alternative theories of inductive inference are presented whose validities appear to be corollaries of the validity of the first method described.