# GSoC 2009 - Extending MOSES to evolve Recurrent Neural Networks

### Abstract

MOSES has outperformed GP on several tasks. Because RNNs are difficult to evolve for many of the reasons that program trees are difficult to evolve, extending MOSES for RNNs may result in an algorithm that is more effective than current GA + NN techniques. Doing so will require extensions to Combo and Reduct that will be applicable to continuous domains in general even if MOSES does not extend well to RNNs as anticipated.

## Current Progress

All dates are for 2009, month/day.

### 8/19

The lru cache is working, and the hardest variant of pole balancing has been ported. I fooled around with reduction some more, trying to make sure everything was functioning correctly, which it seems to be. I then ran 10 trials of MOSES+RNN on 2 pole-balancing without velocity with reduction and without reduction (100000 evaluations). The results were underwhelming: without reduction the average number of time steps before the poles fell was 131, and with reduction the average number of time was 78. So, reducing the search space actually appears to have had a negative impact of performance. Although coding for GSoC has ended, I still want to explore using an alternative optimization routine such as hillclimbing and trying to figure out why the reduction does not seem to be helping search. My final official tasks for GSoC will be to polish documentation and format my source code.

### 8/12

After playing around with population size a bit at Ben's suggestion, it does seem like having a larger population size may help. There is a parameter that modulates the population size as a ratio of the bits of the problem size. After increasing this by a factor of 10, MOSES+RNN did seem to more consistently solve the double pole balancing task, but still needed many more evaluations (on the order of 100,000) than most other methods.

I also have mostly implemented the LRU CACHE...but it was trickier than I first thought. I made a wrapper for Scoring that can be cached, then put the cache in count_based_scorer. However, originally, count_based_scorer is recreated whenever a deme is optimized or an exemplar is expanded. So, I had to make it a member of the metapopulation class. There is still some error I am getting that I have not been able to track down, but I will fix that before going further.

For the last week before I start to get everything polished etc, I will implement the hardest benchmark (2 pole balancing w/o velocity information) and see if MOSES+RNN can solve it and if the reduction measures that have been implemented have helped.

### 8/5

I've added more to ANN reduction. First, all 0'd connections are removed, next all disconnected hidden nodes are removed (and their connections are also removed, which means that potentially yet more nodes can be removed). Finally, the mapping from ANN -> ComboTree utilizes a sorting of the nodes based on summed magnitudes of incoming and outgoing weights.

The exact nature of the way the sorting works is not *exactly* as it had been discussed on the mailing list: Due to the tree nature of the representation, it is not simple to peel off layers and sort in that way. What is done is that the incoming connections of each neuron are sorted by the "summed magnitude" of the source neuron. When an ANN is converted into a Combo Tree, it recurses back from the output neurons, placing each node into the tree in an order specified by the order of the connections. In this way, many different unsorted structures will map to the same combo tree.

Frustratingly, this has not really improved performance with respect to double pole balancing. The next step is to try a different optimization routine, although I am not sure on the status of any others that use contin knobs. There is also talk of creating a cache to avoid ineffeciencies in re-evaluating many individuals. That will be my next step if there are no means to use other optimization routines.

### 7/28

I finished the parameter sweep, with slight success, and started work on ANN reduction rules. The one I have right now is a simple one that removes all 0 weights. Strangely, it seems to hurt performance slightly at the moment, I need to examine why that is (there may be a bug, although it looked like it was working.)

For the parameter sweep, I investigated the following settings with MOSES contin knob building on three tasks, xor, single pole balancing, and double pole balancing. All combinations of the following settings were tried, each combination averaged over 5 runs of 5000 evaluations for a total of 1500 runs: Depth: 2,4,8,16,32 Step Size: 0.25,0.5,0.75,1.0,1.25 Expansion: 0.5,1.0,1.5,2.0

Interestingly, out of the 500 runs for double pole balancing, it was solved exactly once, while performance on xor and single pole balancing was generally fairly good for all of the settings.

The settings I've arrived at are 4 for depth, 1.25 for step size and 1.5 for expansion, based on the following plots:

My next step will be to work further on reduction, make sure it is working properly and to extend it to remove hidden nodes that are no longer connected.

The updated code is available at: https://code.launchpad.net/~lehman-154-gmail/opencog/moses-rnn

### 7/22

I've been doing a few things this week. Firstly, I tried to further tweak to increase performance of MOSES+RNN. This was kind of a wash, by changing the contin knob settings, I was able to get better performance, but still am unable to get the double pole balancing (which is not a super hard problem) to work. I compared by hand the expected output of a small ANN with a recurrent-elmann node in it to the computed value, which matched. This is now implemented as a test case in /tests/comboreduct/combo. So it seems that the ANN is working as planned. I've also implemented most of the framework to do a parameter sweep on the contin_settings just to make sure that in my hand optimization I haven't missed anything.

I did try double pole balancing with hillclimbing as described last week, but found out that continuous variables are not yet implemented (although may be in the near future). So, I will wait on that front. Perhaps the time has come for reduction rules, since I am nearing a completion point on most of the other aspects of the project.

In addition, I've documented most of my code (especially the ANN class), but there is more to be done on that front as well (the ann_scoring functions and the knob_building code).

The updated code is available at: https://code.launchpad.net/~lehman-154-gmail/opencog/moses-rnn

### 7/14

Things are starting to come together. After tracking down a few tricky bugs dealing with the memory nodes, I believe I have evolution of elman-style RNNs pretty functional. In addition, I added a graphviz write-out of a network that helped during debugging.

I've ported two pole-balancing tasks: 1-pole and 2-pole. There is one more difficult pole balancing task (2-pole with reduced number of inputs) but I won't port that until I am able to solve the easier tasks. Moses-RNN can solve the easier 1-pole task, but struggles currently with the 2-pole task.

Next steps are to begin looking more seriously at reduction rules, as well as continuing to polish the contin-knob settings and making sure that the RNNs are functioning as we would like. Also a task I didn't get to from last week was to see if hill-climbing did any better at optimizing than the univariate model. There are clearly many non-linear interactions and dependencies among the ANN weights, so I would imagine a univariate independence assumption might obfuscate good weight settings.

The updated code is available at: https://code.launchpad.net/~lehman-154-gmail/opencog/moses-rnn (the combo files you need to run the ann tasks are in the moses/main directory)

### 7/8

Coding is going well, I am nearing having MOSES evolving RNNs (using the generalized Elman structure) on a simple benchmark task (pole-balancing).

Combo and the ANN class have both been extended to handle "memory nodes" (i.e. in Elman networks, the special input nodes that get a time-delayed signal from a hidden node).

The way that this is represented in combo is that if an input node has a child (ordinarily it wouldn't, an input node does not have connections from other nodes), it is interpreted as a memory node that gets its time-delayed signal from its only child node. For example:

(Ann (Ann_output#1 (Ann_node#2 (Ann_input#3 w1) (Ann_input#4 (Ann_node#2) w2) w3))

This would create the following ANN:

Input#3 -> Node#2 (w1) Input#4 -> Node#2 (w2) Node#2 -> Output (w3)

Where Input#4 would be a memory node getting its input from ann_node#2.

Expansion now works differently. It was hard to do expansion directly on the combo_tree (because it represented a NN in a distributed fashion). So, I created a translator that can change an Ann back into a combo tree. Now, I can expand by mapping the exemplar to a NN, expanding the NN itself (which is intuitive), and then mapping the expanded NN back to combo. With the translator, I can play around with expansions much easier.

Right now, MOSES evolving a simple feedforward network to solve the XOR task does not work too well. This is likely for two reasons: I need to fix the contin knobs so that weights are generated between a more constricted range and with greater precision. Also, the univariate optimization may not be powerful enough to deal with weights which interact in non-linear interdependent ways. I may see if hillclimbing does any better.

6/30 - It was decided that the easiest way to solve the DAG-like issue described in last week's update was to simply give each neuron an id # and have it possibly appear in the combo tree multiple times instead of having hard links for all connections to unique neurons.

Coding is going well, there is finally a working prototype of feed-forward neural network evolution operating on an XOR classification task. This required retooling the ANN vocab from last time, as well as creating all the glue necessary to apply MOSES to a new domain. The last tricky bit that still needs polish is the expansion code.

The current expansion operates sort of like a cascade architecture, in which a single neuron is added that has connections to each of the outputs, and connections from each of the hidden and input nodes. This restricts the topologies of the nets and may not be the most effective way to go about this. However, it is functional at the moment I believe.

So the next step is to polish expansion, and then work on implementing the generalized elman network (which is basically evolving two feed-forward NNs).

Also, it appears the range of continuous parameters needs to be tweaked for NNs, sometimes weights of high magnitude >15 appear, which will completely saturate a node.

The updated code is available at: https://code.launchpad.net/~lehman-154-gmail/opencog/moses-rnn

### 6/23

Updated the code as per Nil's suggestions. Added an XOR-scoring function (typical neuroevolution benchmark), finished the combo -> NN translating function. Added an ann type so that ANN expansion can plug into knob building. The updated code is available at: https://code.launchpad.net/~lehman-154-gmail/opencog/moses-rnn

The current sticky situation is that in order to fully represent ANNs in combo, it was thought that DAG-like trees would have to be expressible in the combo_tree class. Unfortunately, this would require modifying the class as a single node does not have access to a unique list of its children, but infers it through the siblings of its first child until it hits its last child.

One solution is to add id #'s to neuron nodes (kind of like how variables are expressed in combo, #1, #2..etc) so that it is easy to tell which nodes are the same without having many physical links to the same node.

An example would look something like:

(Ann (Ann_output#1 (Ann_node#2 (Ann_input#4 w1) (Ann_node#3 (Ann_input#4 w2) w3 w4))

Which would be translated into an ANN with the following links:

Input -> Node#2 (w1) Input -> Node#3 (w2) Node#2 -> Output (w3) Node#3 -> Output (w4)

The other solution is to implement lists and local variables so that feed-forward NNs are expressible natively in combo:

(this snippet is stolen from Moshe Loooks) Consider a network with inputs A, B, C, an intermediate layer with nodes X, Y, and outputs O, P, where A->X, B->X, B->Y, C->Y, X->O, Y->O, X->P, Y->P. This would be encoded as :

function (a, b, c) let x = sigmoid (w1* a, w2 * b), y = sigmoid (w3 * b, w4 * c) return list(sigmoid(w5 * x, w6 * y), sigmoid(w7 * x, w8 * y))

### 6/16

Coding is progressing on feed-forward ANNs in MOSES. The additional built-ins (ann, ann-node, ann-input) have been added to combo and a test combo tree has been created ann(ann-node(ann-input rand)). A simple ANN class has been created, and a function that translates a codified combo tree into an ANN is coming along.

One problem I was having, that I think is now resolved, was how to identify ann-node or ann-input nodes in the combo tree that were the same. That is, unlike in normal combo trees, there may be many parent nodes that point to the *same* child node. My solution is to tag every node created in the ANN class with the corresponding tree_node pointer from the ann-node vertex in the combo tree. Then we can look up existing nodes by their tags while we traverse the combo tree.

The next step is to finish the combo tree -> ann decoder, which I should have in the next day or so. After that, I'll start to address knob-building and expansion for codified ANNs. A decent expansion I think is to take an existing connection, create a new hidden node that is connected from the source node of the existing connection, and is possibly connected to all downstream neurons.

After that, a simple test-domain sanity check will be constructed before going onto the generalized elman topology.

### 6/9

Ben found a suitable canonical form for RNNs and a way of using GP-structures to represent arbitrary feedforward ANNs (called graph codifying). The details for the canonical form can be found here:http://portal.acm.org/citation.cfm?id=285454 (if you don't have an ACM account, I can email it to you). The details for the graph codifying algorithm can be found here: http://www.waset.org/pwaset/v15/v15-39.pdf Together, it appears these methods suggest a way to evolve an arbitrary RNN using MOSES.

The basic idea behind the canonical form of RNNs is that it is possible to express any RNN using two feed-forward functions: z(n+1) = f1( z(n), u(n) ) y(n+1) = f2( z(n+1) ) Where z is a vector of state variables (i.e. the activation levels of the neurons, u is the input vector, and y is the output vector).

Furthermore, given enough hidden neurons, a feedforward ANN with one hidden layer can approximate any function. So, given two such feedforward ANNs, we can approximate any z(n+1), y(n+1) function, and thus any RNN.

One drawback to this approximation method is that we do not know the structure of the RNN we are approximating -- that is, there is not a clear mapping from an arbitrary z(n+1), y(n+1) function pair to an RNN. All that we can say is that we have equal expressive power to an RNN. This makes comparisons to existing neuroevolution methods less meaningful. Correction: We can interpret the evolved function pair as a generalized elman network, so a comparison may still be possible. (I'll expand this section later)

The basic idea behind the codifying algorithm is that we can express a feed-forward ANN as a DAG that is amenable to the tree-structure of MOSES. This is done by introducing a few new terminals into the MOSES vocabulary. First is the "ANN" node, which is the top-level root node, and it has as many children as there are outputs of the ANN. Each of these children will be "ANN-Node" nodes. Each "ANN-Node" node will have 2*n children, n of which will be to hidden nodes "ANN-Node"s, or input nodes "ANN-Input" nodes. The other n children will be continuous-value constants that indicate the weight of the connections. This process continues until at the end all connections terminate at "ANN-Input" nodes. In this codified graph format, all feedforward ANNs can be represented.

One complication to this method is that as described in the paper it would require DAG-type structures, while MOSES is designed on tree-structures. The structures that graph-codifying describes are "nearly" trees, in that all links progress from the root downwards, although some nodes may share children. This may not prove to be too hard to accomodate. Another approach (that could also be used to evolve arbitrary RNNs), is to replace the ANN-Node and ANN-Input terminals with arguments (#1 - ... #n) where n is the number of neurons in the model. The upside is that it would require little drastic modification to MOSES, although it would complicate reduction. My current thoughts on reduction however, are that it is probably best handled outside of the current reduct engine: symbolic derivatives of an ANN's outputs with respect to the weights would indicate which weights are extraneous and can be removed. This would be done after constructing the ANN from the graph codified representation.

I've been getting more comfortable with the MOSES code, and think I understand most of it now. So, I have sketched out the code changes I will need to make (to evolve arbitrary-topology feed forward networks by the graph codification method), and will start on coding this week.

Proposed Code Changes for Feed-forward ANN MOSES

### 6/2

After talking with Ben & Nil on the dev list, it was decided that the best course of action might be to investigate a subclass of RNNs called Sequential Cascade Networks (SCN), detailed in this paper: http://demo.cs.brandeis.edu/papers/dyrec.pdf

The motivation is that it is not immediately clear that there are any RNN structure preserving rewrite rules that can reduce an RNN to a normal form. This is because basically the structure of an RNN is rigid -- the only real parameters to be tweaked are the connections and their weights.

SCNs have a hierarchical structure that hopefully the hBOA optimization algorithm can take advantage of, and may give insight into the more general RNN case. Basically a SCN consists of two seperate feed-forward networks, a master and a slave network. The slave network calculates a function, while the master network determines the next set of weights for the slave network given as input the last output of the slave network. The two networks are coupled together as a dynamical system inspired by discrete FSMs.

So, the first step along this path is the implementation of a simple feed-forward network in MOSES. Despite reading Moshe's thesis, the documentation included with OpenCog, watching nil's webcast, and reading through the code, I am still overwhelmed by the progam: the learning curve is dramatic, especially since it seems I am trying to do something that the system was not designed for. That is, it is designed for manipulating tree-like structures, not lists or matrices of parameters (weights).

It seems as if I have three options:

1) Simply use the hBOA to evolve the parameters

This would basically follow the contin-max example and would ignore the MOSES algorithm. It might be a decent first step and would be fairly straight-forward, but probably would not extend to anything interesting.

2) Create a new tree type for neural networks, including the corresponding additions to knob-building (only modify the weights), expanding (disable entirely for fixed-topology nets for now), and reduction (also disable for now).

The benefits of this option is that it preserves much of the MOSES algorithm (although much of it is actually disabled), and it is definitely possible to express a single-output perceptron (no hidden nodes) in combo:

(sigmoid (+ (* weight1 #1) (* weight2 #2) (* weight3 #3)))

The drawbacks are that it is not immediately clear how to extend this model to express networks with hidden nodes or multiple outputs (maybe the way I described in the last update). Also, much of what differentiates MOSES from other algorithms is disabled -- it seems like what in effect we are doing is simply iterating hBOA on the weight matrix.

3) Extend moses to work with an alternate representation besides tree-based combo, either a weight matrix or a graph representation (most natural).

This is an interesting idea, and would mean basically specializing MOSES for neural network evolution by operating on a more natural representation for NNs. A graph is the natural way of describing an arbitrary topology NN. However, making these changes might be overly ambitious as a tree type is assumed in all of the MOSES architecture.

Right now I am working on option #2, although I am not convinced it is necessarily the best for working with RNNs.

5/25 - Have finished Moshe's thesis and worked through the MOSES and ComboReduct screencasts. The current emphasis is on extending MOSES to use lists and functions of lists, so that RNNs can be easily expressed.

My current idea of how to express an RNN in combo is as follows: A placeholder progn node is at the top of the program tree, with a branch for each node in the neural network. Each node's branch will look like this:

(Progn (Sigmoid (Fold + (Map * Weights Activations)) (Sigmoid (Fold + (Map * Weights Activations))

So each branch will update one neuron's activation level. Weights will be a vector of continuous numbers, while Activations will be a vector of inputs. There will be one input for each neuron in the neural network which corresponds to that neuron's current activation level.

Basically what we are doing is forcing the more natural weight-matrix representation of an RNN into a program tree. The hope is that we can better use this format to apply reduct rules. However, I am starting to doubt that (non-approximate) reduct rules are going to be much use for this representation.

What possible reductions can there be that maintain the structure of the RNN? Basically the parameters that we are interested in are the weights, and whether two neurons are connected or not. If there was a neuron that had basically no effect on the neural network, we would like to reduce it out, but it seems like for a neuron to have exactly *no* effect, it would have to either have no connections to neurons that influence the output or connections with weight of exactly zero (which is unlikely).

However, it may be the case that approximate-rewrite rules could be somewhat effective. That is, weights near zero might be clipped. However, because of the chaotic mapping between genotype to phenotype in many domains (robot control etc.) it is going to be difficult to tell whether the reduction truly has little effect without evaluating the reduced network. One can get a ballpark idea by using a symbolic (or numeric approximation) of the gradient of the ANN with respect to a certain weight .

So, I am tempted to say that it may be much simpler to bypass the tree representation and just use the weight matrix to represent a RNN. However for now, I will work on integrating lists and map/fold into ComboReduct.