# Information Geometry for MOSES Learning

This page contains some rough thoughts (from Ben Goertzel, initially) on how to make MOSES program learning work better within a deme, via using information geometry.

The natural gradient provides a nice way to do local or semi-local search within any space of probability distributions (gradient descent using the natural gradient, which is defined in terms of Fisher information) ...

However, programs are not probability distributions; so to use this approach one needs a good way of mapping programs into probability distributions.

For programs that map Boolean or real vectors into Boolean or real-number outputs, the method proposed in Yann Olivier's paper on information-geometric optimization ( http://www.yann-ollivier.org/rech/index.php#Artificial%20intelligence,%20learning,%20optimization ) may be a good one...

For programs implementing more general programmatic constructs (e.g. list, fold) something else may be necessary...

I have thought for a while about the idea of using a program X to induce a probability distribution as follows. Given a program X and another program Y, one can define the probability P_X(Y) as the probability of producing a program behaviorally equivalent to Y, via choosing a random program run on a machine that is supplied with X as low-cost data. The odds of choosing program Z are higher if Z is longer, and also higher if Z accesses more of X. But the cost of accessing bits from X is low compared to the cost of actually using new bits for Z. This induces a distribution on program space, with the basic property that compact programs more algorithmically related to X have a higher probability.

[Being able to represent a program as a distribution has various nice consequences, e.g. the ability to calculate interaction information among sets of programs.]

If it were easy to compute the probability distribution over program space corresponding to a given program, then one could do search over program space by searching over the space of associated {probability distributions over program space}. Unfortunately, calculating these probability distributions is uncomputable in general, and highly expensive in practice...

On the other hand, approximate methods may be more feasible. It's relatively tractable to calculate the cost of syntactically transforming program X into program Y, using e.g. dynamic programming to find a min-cost transformation path between the corresponding program trees (or dags). The question is whether this cost bears any resemblance to the probability P_Y(X) desdribed above. If the programs are represented syntactically in a "good enough" way, then maybe it will.

Suppose one has a number of programs within a MOSES deme. Then, these different programs have been produced by iteratively varying on the exemplar program of the deme. So, transformation paths between the programs are already known (though not necessarily min-cost paths). Each program X within the deme may then be approximatively associated with a probability distribution, which is known via knowledge of probabilities P_X(Y) for other programs Y in the deme. One can then calculate the natural gradient direction from the current best guess in the deme, and create a new program in that direction [details on this latter point to be filled in later... at very worst it could be done via some sort of stochastic local search]

In order for this to make sense, there has to be some level of "syntax-semantics correlation" in program space, so that programs that are minor syntactic variants generally tend to be minor semantic variants; and so that the syntactic neighborhood is a reasonable approximation of the semantic neighborhood. It's not clear how well this can be achieved via a well-crafted normal form. If not, then it would have to be achieved dynamically for each program, via reformulating the program into a form providing increased syntax-semantics correlation.

Also, the interestingness of this approach depends on how much you value information geometry. The whole point here is to use the natural gradient by casting everything into the language of probability distributions. If you don't care about that, you can just use any old gradient. But you have to admit there's some real elegance in being able to cast any kind of program into a probability distribution, as opposed to handling things on more of a case by case basis...