OpenCogPrime:OccamsRazor

Occam's Razor

This brief note discusses an important cognitive process that fits loosely into the category of "OCP Procedure learning" — it's not actually a procedure learning process, but rather a process that utilizes the fruits of procedure learning.

The well-known "Occam's razor" heuristic says that all else being equal, simpler is better. This notion is embodied mathematically in the Solomonoff-Levin "universal prior," according to which the a priori probability of a computational entity X is defined as a normalized version of:

${\displaystyle m(X)=\sum _{p}2^{-p}}$
m(X) = Sum_p 2^( -1 (p))


where:

• the sum is taken over all programs p that compute X
• l(p) denotes the length of the program p

Normalization is necessary because these values will not automatically sum to 1 over the space of all X.

Without normalization, m is a semimeasure rather than a measure; with normalization it becomes the "Solomonoff-Levin measure" (Zvonkin and Levin, 1970).

Roughly speaking, Solomonoff's induction theorem (Solomonoff, 1964, 1978) shows that, if one is trying to learn the computer program underlying a given set of observed data, and one does Bayesian inference over the set of all programs to try and obtain the answer, then if one uses the universal prior distribution one will arrive at the correct answer.

OCP is not a Solomonoff induction engine. The computational cost of actually applying Solomonoff induction is unrealistically large. However, as we have seen in this chapter, there are aspects of OCP that are reminiscent of Solomonoff induction. In concept-directed schema and predicate learning, in pattern-based predicate learning — and in causal schema learning, we are searching for schemata and predicates that minimize complexity while maximizing some other quality. These processes all implement the Occam's Razor heuristic in a Solomonoffian style.

Now we will introduce one more method of imposing the heuristic of algorithmic simplicity on OCP Atoms (and hence, indirectly, on OCP maps as well). This is simply to give a higher a priori probability to entities that are more simply computable.

For starters, we may increase the node probability of ProcedureNodes proportionately to their simplicity. A reasonable formula here is simply:

2 ^( -r c(P) )


where P is the ProcedureNode and r>0 is a parameter. This means that infinitely complex P have a priori probability zero, whereas an infinitely simple P has an a priori probability 1.

This is not an exact implementation of the Solomonoff-Levin measure, but it's a decent heuristic approximation. It is not pragmatically realistic to sum over the lengths of all programs that do the same thing as a given predicate P. Generally the first term of the Solomonoff-Levin summation is going to dominate the sum anyway, so if the ProcedureNode P is maximally compact, then our simplified formula will be a good approximation of the Solomonoff-Levin summation.

These apriori probabilities may be merged with node probability estimates from other sources, using the revision rule. The question arises how much weight of evidence to assign to the apriori probability as opposed to other sources. This again becomes a system parameter.[BNG1]

A similar strategy may be taken with ConceptNodes. We want to reward a ConceptNode C with a higher apriori probability if C » SatisfyingSet(P) for a simple PredicateNode P. To achieve this formulaically, let sim(X,Y) denote the strength of the SimilarityLink between X and Y, and let:

sim*(C,P) = sim( C, SatisfyingSet(P) )


We may then define the apriori probability of a ConceptNode as:

pr(C) = Sum_P sim*(C,P) 2^(-r c(P))


where the sum goes over all P in the system. In practice of course it's only necessary to compute the terms of the sum corresponding to P so that sim*(C,P) is large.

As with the a priori PredicateNode probabilities discussed above, these apriori ConceptNode probabilities may be merged with other node probability information, using the revision rule, and using a default parameter value for the weight of evidence. There is one pragmatic difference here from the PredicateNode case, though. As the system learns new PredicateNodes, its best estimate of pr(C) may change. Thus it makes sense for the system to store the apriori probabilities of ConceptNodes separately from the node probabilities, so that when the apriori probability is changed, a two step operation can be carried out:

• First, remove the old apriori probability from the node probability estimate, using the reverse of the revision rule
• Then, add in the new apriori probability

Finally, we can take a similar approach to any Atom Y produced by a SchemaNode. We can construct:

pr(Y) = Sum_{S,X} [ s(S,X,Y) 2^(-r (c(S)+c(X))) ]


where the sum goes over all pairs (S,X) so that:

ExecutionLink S X Y


and s(S,X,Y) is the strength of this ExecutionLink. Here, we are rewarding Atoms that are produced by simple schemata based on simple inputs.

The combined result of these heuristics is to cause the system to prefer simpler explanations, analysis, procedures and ideas. But of course this is only an apriori preference, and if more complex entities prove more useful, these will quickly gain greater strength and importance in the system.

Implementationally, these various processes are carried out by the OccamsRazor CIMDynamic. This dynamic selects ConceptNodes based on a combination of:

• importance
• time since the apriori probability was last updated (a long time is preferred)

It selects ExecutionLinks based on importance and based on the amount of time since they were last visited by the OccamsRazor CIMDynamic. And it selects PredicateNodes based on importance, filtering out PredicateNodes it has visited before.