Fake Guido: AGI

Having given the topic of artificial general intelligence (or AGI) a fair amount of thought over the last decade, a plausible path forward, based on something like an encyclopedia for algorithms, has slowly congealed in my head. I need to write it down before I start coding. Preferably in the form of a blog post, to get feedback. This is that post.

Background

When I write about strong AI or AGI, I mean an algorithm with general problem solving skills. Not necessarily a mind inhabiting a robot, running around talking to people and passing the Turing test (though eventually a successful AGI could be taken in that direction), but rather something that can be applied to a wide variety of problems. For example: playing chess and poker, picking stocks, solving puzzles, proving math theorems, analyzing and writing computer code, speech and image recognition, improve itself by learning and self modification, etc.

There is an alluring approach to AGI, where you begin with a "simple" seed program, which will learn and self improve and eventually evolve to human intelligence and beyond. I think that in practice, the problem is that it has taken lots of people lots of time to invent all those algorithms that can be useful in general problem solving. Humans are pretty good at general problem solving - certainly much better than the best software/hardware combination we have today. Constructing algorithms to solve specific problems is in itself a kind of problem solving, and we humans have certainly invented many different algorithms for a wide variety of purposes. One might suspect that the computational depth of inventing/discovering algorithms is very large. So unless the seed program is actually very advanced, more like a full grown Sequoia than a seed, it might take too much time for it to invent all those algorithms and heuristics that we, as a civilization, already have.

Topics that are potentially useful to understand include:

Statistics (Bayes rule, distributions, Markov chains, running an experiment, etc).
Algorithms for optimizing parameters (genetic algorithms, simulated annealing, steepest descent, linear programming, random testing and purely analytical methods). In some situations it could take days or more to try a single parameter configuration. In other situations evaluating the fitness of a parameter configuration is just a couple of CPU instructions. Approaching these different tasks require a variety of methods
Logic
Basic mathematics (calculus, algebra, geometry, etc)
Code analysis (lambda calculus, etc)
Formal proof methods (knowledge of the methods listed here: http://en.wikipedia.org/wiki/Mathematical_proof) and formal reasoning
Tree and graph searching (depth-first, breadth-first, A*, beam, minimax, alpha-beta, Dijkstra)
Bayesian belief networks
Pattern recognition
Compression
Monte Carlo method
Clustering and classification
Fourier transforms, wavelets
Function approximation (analytical or with neural networks or genetic programming)
Inverting functions (in other words, given a program function and its output tell me what the input was - this turn out to be a very general way of posing questions)

...and of course many more.

I consider statistics and parameter optimization to be the most important areas for intelligence, since you need them to learn. Pattern recognition (perhaps implemented with statistics and optimization) and various forms of tree searching are also vital.

An encyclopedia of algorithms

My approach is based on implementing an encyclopedia of useful algorithms that:

Know to which tasks they can be applied
Can give a rough, initially often ridiculously rough, estimate of what the probability is that they solve the task after a certain time or, in the case of an open-ended task such as optimization, can give a rough estimate of how well the task is solved after a certain time.
Can continuously update the estimate as the task is solved

It is important to stress that it is not enough to just implement a library of algorithms that can work on the same datastructures. The important thing is that you need metadata, describing when an algorithm can be used and the algorithmic complexity in time and memory. With time, you want to automatically build up more knowledge of the algorithms, gradually improving the time and success estimations as well as improving your knowledge of which algorithms are suitable in which situations.

The algorithms should be broken up into as many natural subtasks as possible, so that when new algorithms are added to the system, they can try to solve these subtasks as well, thus creating new hybrid algorithms.

A task is basically a function call together with its arguments. An algorithm that can solve a task implements the corresponding function and a time/success estimator. Similarly to function overloading in C++, the function header might state specializations, additional properties, of the function arguments that must be true for the algorithm to be a contender to solve it. It is important that the Scheduler (see below) immediately knows which algorithms are suitable for a certain task, so the mentioned "additional argument properties" must be immediately available. If a certain property requires work to find out - "is the list in the first argument sorted?" - and an algorithm still needs it, a new algorithm can be constructed that first checks if the list is sorted and then either fails or asks the task again with the new property set. This new algorithm would have higher estimates of running time and lower estimates of success than the original algorithm.

The Scheduler

When a task is added to the task pool it always has a price attached to it. The Scheduler runs those algorithms that currently promise best expected price per time unit. Algorithms that need subtasks solved has to assign a price to those too, before adding them to the task pool. That price should reasonably reflect how much of the overall time the subtask is expected to take. If it turns out that a subtask consistently take a smaller or larger fraction of the estimated total time, there should be algorithms that modify the price for these subtasks and correspondingly the total time estimate (also, see Self Improvement below).

Open-ended tasks where something should be optimized cannot have just one value attached to them. Instead they need to have a function from achieved performance to price, or at least a rough mapping from some performance values to price. This mapping stops the system from optimizing for too long on a relatively unimportant subsubsubtask somewhere.

Algorithms that can either fail or succeed on a task need a similar mapping, where they give probability of success as a function of time.

One can also imagine that the algorithms could give a confidence interval or standard deviation on their estimates to tell the Scheduler how sure they are of their estimates, but I am not quite sure how this should be used, so for now they won't.

For my first try, the Scheduler will use a simple heuristic. The algorithm that claims to have the best price / time ratio for any task currently in the task pool will get to run it. For one thread this will be optimal in some sense. It gets more complex when you have many algorithms running in parallel on multiple cores or even clusters. For example, you want to slightly punish two algorithms trying to complete the same task in parallel, since the first one to succeed will always make the other algorithms work moot. On the other hand sometimes it makes sense to attack an important problem from several angles, so you don't want to forbid it entirely either.

In a later design, the Scheduler should be able to use the task pool to think about how it should Schedule. Obviously this must not end in an infinite Scheduling loop or general inefficiency, since normally the Scheduler must work very quickly.

The Scheduler's work and indeed that of the whole system will not be especially interesting when there are only a few algorithms implemented. The first interesting moment will be when new hybrid algorithms emerge, where subtasks are sometimes handled by unanticipated algorithms. I am not sure how many algorithms needs to be implemented for the system to show interesting emergent behaviour. Probably more than ten, but less than a hundred, depending, of course, on which algorithms and what you count as an individual algorithm.

Self improvement

From the above, you can see that the system will not be self improving at first. However, by adding self improvement tasks, it will start doing things like improving the time estimates of the algorithms, learn to what degree one algorithm's failure to solve a task should also reflect on the estimates of other algorithms, learn which situations are suitable for which algorithms; for example which algorithms perform well on the subtasks posted from a certain algorithm. It can also have an algorithm that constructs new lower-priced training tasks from real tasks, for example generalizations or specializations of a problem, just out of "curiosity".

Producing new/improved code and algorithms, either for self improvement or as the solution for a puzzle or some other task, is among the most advanced tasks the system can try. It will not be able to do much of interest in this area until it is really strong, but it could start out by trying simple modifications of existing algorithms or trying them on similar tasks, a bit like in genetic programming.

The system is also inherently self improving from a sort of network effect, since for each algorithm added, the existing algorithms get potentially better.

What now?

When I have implemented the base system, I will start by applying the AGI to function inversion. Trivial stuff at first, of course, but I hope to eventually make it solve real puzzles like Towers of Hanoi by a combination of searching and deduction. Also, it would be fun to try some games and an NP-complete problem like 3-SAT.

It would be beautiful if the algorithms were written in the same simplified, purely functional (thus easier to analyze), LISP that I plan to write the problem definitions in. Alas, good AI needs to be fast and a 100x slower system just because the algorithms run in my own immature poorly interpreted language instead of C is not so fun. However, a good JIT compiler is a very good test for an AGI. You continuously have to weigh optimization time against running what you have. If the AGI in some distant future JITs its own code, effectively running and optimizing itself, I will consider the entire project a grand success :).

I forget why I called the project Guerilla. It was probably terribly clever. Nevertheless, here is the link to the Github repository: https://github.com/gurgeh/Guerilla. It does not contain much yet.

Compression is one of the most powerful concepts in computing. For the normal computer user, compression is associated with making files smaller, so you can store more of them or send them faster over the Internet, but there is much more to it.

An optimal compressor can (or could, if it existed) be used for prediction of the future of a sequence of events (weather, sports, stocks, political events, etc), for example by trying all possible continuations and examine how well it compresses given the history. Conversely, an optimal predictor that gives the correct probability of each possible next symbol can be used for optimal compression by using arithmetic coding.

Compression and prediction

This section on background theory contains possibly scary math and dense prose, but should be understandable for most programmers. Maybe re-read the sentences a couple of times.

Ray Solomonoff has shown [PDF] that if we let Sk be the infinite set of all programs for a machine M, such that M(Sk) gives an output with X as prefix (i.e the first bits of the output is X), then the probability of X becomes the sum of the probabilities of all of its programs, where the probability of a program is 2 ** (-|Sk|) if |Sk| is the length of the program in bits and "**" means "to the power of". As X gets longer, the error of the predictions approach zero, if the error is calculated as the total squared probability difference.

A technicality is that only those programs count, that does not still produce X, when the last bit of the program is removed.

To give a slightly more concrete example, say that you have a sequence of events - a history - and encode those as a sequence of symbols, X. Let us further say that you have a machine, M, that can read a program S and output a sequence of symbols. If you have no further information on your sequence of events, then the best estimate for the probability of a symbol Z to occur next (i.e the best prediction) is given by the set of all programs that output your history X followed by Z. Programs which output X+Z and are short are weighted higher (the 2 ** (-|Sk|) part).

Even more concretely, given the binary sequence 101010101, you wonder what the probability is that the next bit will be 0 given that you know nothing else of this sequence. Sum 2 ** (-program length) for all programs that output 1010101010 vs those that output 1010101011 as their first bits (they are allowed to continue outputting stuff). If we call these sums sum0 and sum1 respectively, then the probability of 0 coming next is sum0 / (sum0 + sum1) and the probability of 1 coming next is sum1 / (sum0 + sum1).

Obviously you cannot find all these programs by just trying every possible program, because 1) they are infinitely many and 2) given the halting problem you cannot in general know if a running program is in an infinite loop or if it will eventually output X.

There is an area of probability theory called Minimum description length, where the language is chosen to be so simple (not Turing complete) so that you can actually find the shortest program or "description". Calculating probabilities this way is very similar to Bayesian probability, but more general.

Solomonoff in (almost) practice

Although the point of the theorem is not to apply it directly in practice, for short sequences X we can actually try. We can avoid problem 1 above, there are infinitely many programs, by generating random programs and see if they produce X. If they do, we count them. This way we can produce an approximation of what the true sum0 and sum1 are. If we set up our random generation such that shorter programs are more likely, then we don't have to bother with the 2 ** (-program length) part and may just have a running count for each sum. If the sequence is too long, this method will be impractical since almost no randomly generated programs will actually output X.

Problem 2 above, when we test programs they may not halt, is harder, but Levin has proposed a way around it. If we in addition to program length use running time (number of instructions executed) as a measure of the probability of our program, we can start by generating all the programs that we intend to test and then run them all in parallel. As our execution moves forward, we will get an increasingly accurate approximation of the true sum0 and sum1, without getting stuck on infinite loops.

If we want to get even more practical, it can be shown that the shortest program that produces X will generally dominate the others and thus it will predict the most likely next symbol. That way you can just search for programs that output X and the currently shortest program will be your best guess. Since we no longer care about the relative probability of the next symbol, but only which is most likely, the search does not have to be random. Thus we can use any method we like for finding a short program. If you search for programs that produce X and find one that almost does, you can construct a "true" solution from that one, by constructing a prefix part that hard codes the places where your original program is wrong. This will produce a longer program, where the length, and thus the "score", is the length of the prefix + the length of your faulty solution. The size of the shortest program that outputs X is called the Kolmogorov complexity of X. The size of the shortest program that outputs X, measured in program size + log(running time), is called the Levin complexity of X.

One way to find these programs is to use genetic programming, just take care that you don't think that you can count the number of programs that produce X and get relative probabilities, because your search will now be skewed towards the solution (and thus its prediction) that you find first.

A small problem is that depending on what machine you choose, i.e which instructions your programs can use and the length of these instructions, you will get different results. The method has a built in bias, since there is no one correct Turing complete language. This difference will however be smaller as X gets longer. One way to understand that is to note that any Turing complete language can emulate any other Turing complete language and that the size of such an emulator is finite. This is called the compiler theorem.

Compression is understanding and the Hutter Prize

When we understand something, we can describe it succinctly. If I have an image of a red perfect circle, the size will be much larger if I describe the individual pixels rather than just say "a red circle of diameter d and thickness t". When I understand what the image is depicting, I can describe it shorter. Sometimes a lossy compression of observed data will actually express the truth better than the exact data. If I take a photo of a red circle, the photo will probably not be perfect, but if I notice what the photo is showing, I can compress it as "a red circle" and some noise which I throw away, and suddenly my lossy compression is a better depiction of the truth.

This equivalence between compression and general intelligence led Marcus Hutter to announce the Hutter Prize, where money is awarded for the best compression of 100 megabyte of English Wikipedia articles. So far the compression algorithms have been impressive (compressing the text to about 15%), but not shown much intelligence or understanding of the articles. When they do start to exhibit some understanding, I think that if they are allowed to compress the data in a slightly lossy way, the first thing that will happen is that some spelling and layout mistakes will be corrected, because these will be surprising to the compressor and thus demand an unusually long representation.

Matt Mahoney has written a good rationale on the Hutter Prize here.

Compression in practice - the juicy stuff

Compression is a powerful tool to measure success and avoid overfitting in a variety of common AI problems. The methods I laid out here are interesting mostly from a theoretical perspective, because of their prohibitively long running times. In my next post, I will expand on my thoughts on how you can use these results to get actual, practical algorithms for common AI problems.

Fake Guido

Thursday, November 10, 2011

Guerilla - my attempt to build a strong AI

Tuesday, December 28, 2010

Compression, prediction and artificial intelligence

About Me

Blog Archive

I read