Using Kernel Methods within MOSES

From OpenCog

I've been thinking a bit about how to take a totally different approach to the problem of "modeling the differences between fit and unfit population elements" in algorithms like BOA, MOSES and PLEASURE.

The direction I'm thinking about is based on using kernel methods for supervised classification.

If any eager volunteer has the time, inclination, knowledge and guts to flesh these ideas out and implement them, let me know. I don't have time to play with it myself, but I'd be happy to work closely with anyone who does want to implement it; I'm pretty sure there are some interesting research papers here, and possibly a way to dramatically improve MOSES for automated program learning....

This is a fairly "encapsulated" project, which has meaning on its own as "machine learning" narrow-AI work, and also potentially a lot of value within an AGI context. But this is rather tricky stuff; it certainly won't be an easy task and is not a job for an AI novice.... It's hard-core machine learning R&D to be sure.

At any stage in the evolution of a population according to one of these optimization methods, one has a class of FIT genotypes (say, programs in the case of MOSES), and a class of UNFIT genotypes.

Then, one has the problem of building a model of what distinguishes the FIT class from the UNFIT class.

Then, after that, the problem of generating new programs according to the model.

What I'm suggesting is a radically different approach to the model-building phase. I suggest using an SVM or related algorithm to build a model distinguishing the two categories.

I have found some literature that is relevant, basically telling what kernel functions can work in the context of spaces of entities endowed with an edit distance. The first of these papers deals specifically with graph classification using an efficient approximative edit distance.

Edit distance-based kernel functions for structural pattern classification

A Generalized Kernel Approach to Dissimilarity-based Classification

(For SVM geeks: these are funky applications because the kernel functions are not positive definite nor even conditionally positive definite, but, they seem to work in practice anyway because the positive eigenvalues are the ones that dominate the kernel's behavior in practice... there is some nice math backing this up...)

The edit distance approach is promising because we can estimate edit distance between program trees represented in Combo.

What I'm not quite sure of is how you'd generate new population elements from these SVM models. Maybe there is some funky math way to do this.

Or, on the other hand, one could generate new population elements via

  • random sampling from the neighborhood of fit programs
  • hillclimbing
  • hBOA

and then use the SVM model for **fitness estimation** and subsequent filtering. This actually seems quite a promising approach to me. I think it's where I would start.

The downside of course is that learning SVM classification models is not cheap. You probably wouldn't want to re-run the SVM after each new fitness evaluation. So what is really called for is incremental SVM learning, as described

which so far as I know isn't supported by libsvm or svmlite, though it's supported in this MATLAB code

and perhaps more usefully in svmheavy:

(though svmheavy is sparsely documented and it's not clear how easy it is to extend it with odd kernel functions as would be needed for this application).

With incremental SVM learning, one would update the SVM model incrementally as new fitness evaluations came in, which would probably make the approach computationally tractable in an evolutionary learning context.