Why is MOSES a good approach to automated program learning? The conceptual argument in favor of MOSES may be broken down into a series of propositions, which are given here both in informal "slogan" form and in semi-formalized "proposition" form.

Note that the arguments given here appear essentially applicable to other MOSES-related algorithms such as Pleasure as well. The page however was originally written in regard to MOSES and hasn't been revised in the light of the creation of Pleasure.

Slogan 1 refers to "ENF", Elegant Normal Form, which is used by MOSES as a standard format for program trees. This is a way that MOSES differs from GP for example: GP does not typically normalize program trees into a standard syntactic format, but leaves trees heterogeneous as to format.

### Proposition: ENF Helps to Guide Syntax-Based Program Space Search

Slogan 1

Iterative optimization is guided based on syntactic distance ==> ENF is good

Proposition 1

On average, over a class C of fitness functions, it is better to do optimization based on a representation in which the (average over all functions in C of the) correlation between syntactic and semantic distance is larger. This should hold for any optimization algorithm which makes a series of guesses, in which the new guesses are chosen from the old ones in a way that is biased to choose new guesses that have small syntactic distance to the old one.

Note that GA, GP, BOA, BOAP and MOSES all fall into the specified category of optimization algorithms

It is not clear what average smoothness condition is useful here. For instance, one could look at the average of d(f(x),f(y))/d(x,y) for d(x,y)<A, where d is syntactic distance and A is chosen so that the optimization algorithm is biased to choose new guesses that have syntactic distance less than A from the old ones.

### Proposition 2: Demes are Useful if Syntax/Semantics Correlations in Program Space Have a Small Scale

This proposition refers to the strategy of using "demes" in MOSES: instead of just evolving one population of program trees, a collection of "demes" are evolved, each one a population of program trees that are all somewhat similar to each other.

Slogan 2 Small-scale syntactic/semantic correlation ==> demes are good [If the maximal syntactic/semantic correlation occurs on a small scale, then multiple demes are useful]

Proposition 2 Let d denote syntactic distance, and d1 denote semantic distance. Suppose that the correlation between d(x,y) and d1(x,y) is much larger for d(x,y)<A than for A<d(x,y)<2A or A<d(x,y), as an average across all fitness functions in class C. Suppose the number of spheres of radius R required to cover the space of all genotypes is n(R). Then using n(R) demes will provide significantly faster optimization than using n(2R) demes or 1 deme. Assume here the same conditions on the optimization algorithm as in Proposition 1.

Proposition 2.1 Consider the class of fitness functions defined by

```Correlation( d(x,y), d1(x,y) || d(x,y) = a ) = b
```

Then, across this class, there is a certain number D of demes that will be optimal on average.... I.e. the optimal number of demes depends on the scale-dependence of the correlation between syntactic & semantic distance....

### Proposition 3: Probabilistic Program Tree Modeling Helps in the Presence of Cross-Modular Dependencies

This proposition refers to the use of BOA-type program tree modeling within MOSES. What it states is that this sort of modeling is useful if the programs in question have significant cross-modular dependences that are not extremely difficult to detect.

Slogan 3 Cross-modular dependencies ==> BOA is good [If the genotypes possess significant internal dependencies that are not concordant with the genotypes' internal modular structure, then BOA-type optimization will significantly outperform GA/GP-type optimization for deme-exemplar extension.]

Proposition 3 Consider the classification problem of distinguishing fit genotypes from less fit genotypes, within a deme. If significantly greater classification accuracy can be obtained by classification rules containing "cross-terms" combining genotype elements that are distant from each other within the genotypes — and these cross-terms are not too large relative to the increase in accuracy they provide — then BOA-type modeling will significantly outperform GA/GP-type optimization.

The catch in Proposition 3 is that the BOA-type modeling must be sophisticated enough to recognize the specific cross-terms involved, of course.

### Proposition 4: Relating ENF to BOA

Now, how does BOA learning relate to ENF?

Proposition 4 ENF decreases, on average, the number and size of cross-terms in the classification rules mentioned in Proposition 3.

### Conclusion Regarding Speculative MOSES Theory

What we see from the above is that:

• ENF is needed in order to make the fitness landscape smoother, but can almost never work perfectly so there will nearly always be some long-distance dependencies left after ENF-ization
• The smoother fitness landscape enabled by ENF, enables optimization using demes and incremental exemplar-expansion to work, assuming the number of demes is chosen intelligently
• Within a deme, optimization via incremental exemplar growth is more efficient using BOA than straight evolutionary methods, due to the ability of BOA to exploit the long-distance dependencies not removed by ENF-ization

These propositions appear to capture the basic conceptual justification for the current MOSES methodology. Of course, proving them will be another story, and will likely involve making the proposition statements significantly more technical and complex.

Another interesting angle on these propositions is to view them as constraints on the problem type to which MOSES may be fruitfully applied. Obviously, no program learning algorithm can outperform random search on random program learning problems. MOSES, like any other algorithm, needs to be applied to problems that match its particular biases. What sorts of problems match MOSES's biases?

In particular, the right question to ask is: Given a particular choice regarding syntactic program representation, what sorts of problems match MOSES's biases as induced by this choice?

If the above propositions are correct, the answer is, basically: Problems for which semantic distance (distance in fitness) is moderately well-correlated with syntactic distance (in the chosen representation) over a short scale but not necessarily over a long scale, and for which a significant percentage of successful programs have a moderate but not huge degree of internal complexity (as measured by internal cross-module dependencies).

Implicit in this is an explanation of why MOSES, on its own, is likely not a good approach to solving extremely large and complex problems. This is because for an extremely large and complex problem, the degree of internal complexity of successful programs will likely be too high for BOA modeling to cope with. So then, in these cases MOSES will effectively operate as a multi-start local search on normalized program trees, which is not a stupid thing, but unlikely to be adequately effective for most large, complex problems.

We see from the above that even in the case of MOSES, which is much simpler than OCP, formulating the appropriate theory adequately is not a simple thing, and proving the relevant propositions may be fairly difficult. However, we can also see from the MOSES example that the creation of a theoretical treatment does have some potential for clarifying the nature of the algorithm and its likely range of applicability.