OpenCogPrime:KnowledgeGuidedHierarchicalProcedureDecomposition

From OpenCog
Jump to: navigation, search

Knowledge-Guided Hierarchical Procedure Decomposition

This page described a specific algorithm we have implemented and tested — in a specialized domain somewhat different from AGI, yet not wholly unrelated — which embodies some of the ideas from that section. We will then discuss the conceptual implications of these experiments for hierarchical evolutionary learning in OCP.

What we describe here is an approach for using domain knowledge to break the procedure-learning process down into smaller procedures-learning problems that can be composed hierarchically to form overall procedures. Specifically, the idea is to identify certain intermediary variables that are likely to be involved in fit procedures, and then explicitly seek both symbolic regression models that predict the intermediary variables, and procedures that take the intermediary variables as inputs. For instance, in the case study to be described here first, the procedure-learning problem is a classification problem that involves determining whether or not an individual has a given disease based on biological indicators such as gene expression data, and the intermediate variables may reflect the extent to which an individual displays various symptoms of that disease, and other relevant clinical data about the individual. Similar ideas may be used in domains more relevant to OCP AGI, and we will give some hypothetical examples in the simulated-agent-control context.

Formalization of Knowledge-Guided Hierarchical Evolutionary Computing

In order to formulate the relevant technique mathematically we will consider the function learning problem in a reasonably general setting. One has a set of input-output pairs

{(x_i, y_i), i=1,…,n}

where the x_i are drawn from an input space X and the y_i are drawn from an output space Y. One wishes to find a compact and simple function F so that the average of the errors ||y_i — F(x_i)|| is small.

In addition to this standard setup, however, assume that one also has some additional intermediate values

{z_i; i=1,…,n}

In general these values may be taken to lie in some intermediate space Z, which may be different from X or Y. One interesting case is where the z_i are simpler in some sense than the x_i. For instance, the x_i and z_i might both be higher-dimensional vectors, but the dimension of x_i might be much higher than the dimension of z_i.

Given the additional intermediate values z_i, one can try to learn a function F that computes the y_i from the x_i via the z_i. That is, one may try to solve the function learning problem of predicting z_i from x_i, and then the problem of predicting y_i from z_i; and then one may chain these predictors together to predict y_i from x_i.

For instance, if each z_i is a numerical vector of length k, then we may say

z_i = (z_i1,…,z_ik)

and we may seek k separate predictors, each one predicting some z_ij from x_i. These k predictors together give a prediction of z_i from x_i, which may be composed with a predictor of y_i from z_i to solve the original problem.

An important special case of this approach (the one used in our current implementation of the technique described in this section) is where the original problem is formulated as classification rather than function learning. In this case the y_i are either 1 or 0, representing positive or negative category membership. But it's important to note that, in this case, the intermediate problems (predicting z_i or z_ik from x_i) may still be general function learning problems rather than just classification problems.

Obviously, this hierarchical-decomposition trick can only work if the intermediate values are judiciously chosen. Choosing the right hierarchical decomposition for any given problem requires significant domain knowledge. In general this problem may be approached via automated reasoning, but in the application to microarray data classification, the hierarchical decomposition emerged immediately and naturally from the dataset.

Finally, we note that in applications to supervised categorization, this approach can actually use any supervised categorization algorithm on the upper level, taking in the intermediary variables as inputs and making categorical decisions based on these. It's most natural conceptually to use evolutionary computing on this layer as well as on the lower, symbolic regression layer, but in doing practical classification problems, we have found that this pure approach doesn't always give the best results.

Application to Clinical-Data-Guided Microarray Data Classification

As noted above, we have extensively tested the above approach in the context of microarray data classification. While this application is not specifically germane to AGI, we briefly discuss it here as it provides a very concrete illustration of the power of the methodology for hierarchical breakdown of complex procedure learning problems.

The microarray data classification problem fits into the framework of the previous subsection as follows. Consider for instance the case where we're classifying a set of people according to whether they have some disease or not. Then each index i corresponds to a particular person, and the yi for that person is either 1 or 0 depending on whether they have the disease or not. The xi for that person is the numerical feature vector derived from their microarray gene expression profile (which may be a simple list of the normalized gene expression values, in the most common case.)

And what are the intermediate values? In some microarray data classification problems there are no good candidates available, so the present approach isn't applicable. In some cases however there is a very strong type of intermediary information: clinical data regarding the degree to which the individuals under study display various symptoms associated with the disease in question, as well as additional more basic clinical data regarding age, sex and so forth. In this case, zik may represent the degree to which individual i displays symptom k; or more generally the value of clinical feature k for individual i. Note that there are likely to be only a handful to a few dozen symptoms or other clinical features, but thousands to tens of thousands of entries in the microarray-derived feature vector. So in this case the intermediary layer provides a significant amount of compression.

The application, then, is as follows. Suppose that one tries to apply evolutionary learning to find a rule predicting disease diagnosis based on gene expression data, but the classification accuracy doesn't live up to one's desires. Then one can try to learn functions predicting, for each person, the extent to which they'll display each clinical feature based on their gene expression values. And one can try (using evolutionary computing or other means) to learn a separate function predicting, for each person, whether they'll have the disease or not based on their symptoms and other clinical features. One can then compose the disease-based-on-clinical-features predictor with the clinical-features-based-on-expression predictors to obtain an overall predictor of disease-based-on-clinical-features. One has simplified a difficult learning problem by breaking it down into smaller component learning problems in a hierarchical way, guided by the intuition that symptoms (and, to a lesser extent, other clinical features) are a useful intermediary between the gene expression values and the disease.

We have applied this methodology to analyze a recently-obtained dataset regarding a number of individuals with Chronic Fatigue Syndrome, and a number of corresponding controls. This is a particularly appropriate example, since CFS is a complex disorder that is classically defined as a combination of symptoms. Our first approach to analyzing this dataset involved applying genetic programming and support vector machines to learn classification models, using either direct or Extended feature vectors. This yielded reasonable results, but not ideal ones (accuracies in the high 70% range). Then, applying the hierarchical approach described here to the dataset, yielded clearly superior results, including a 100% accurate classifier As it turns out, the best results were obtained here by applying SVM for the upper-layer classification model learning process, and genetic programming based symbolic regression for learning the intermediary clinical features based on the microarray data.

Constructing Hierarchical Programs for Controlling Actions in a Simulated World

Microarray data classification is a nice example of the power of hierarchical decomposition, but it's substantially simpler than most of the procedure learning problems faced by OCP. In this section we'll discuss a more difficult case -- learning programs for controlling actions in a simulated world. As we'll see, in this situation the same basic approach as described above can be used. But things are more complicated because the intermediate variables — the analogues of the symptoms in the biomedical application --have variables in them.

Suppose that, instead of trying to learn a model to predict CFS, one is trying to learn a schema that, if executed, will lead to the prediction that some real-world occurrence will come to pass. For instance, the occurrence might be that the red block on the table in the agent's current room is moved into the box in the following room. In this case, the role of the microarray data in the CFS problem is played by the set of near-term perception and action descriptions available to the system (things like I see the table within 5 units of my body or I am moving my arm down). One wants to find a sequential combination of near-term perception and action descriptions so that, if this combination is true, then this predicts the goal will be achieved. Just as in the CFS example, one wants to find a combination of gene expression values so that, if this combination is present, then this predicts the person will have CFS. The main difference is the temporal aspect of the simulation world problem; but it's worth noting this can be there in the biology domain as well, for instance if one is dealing with microarray time series data rather than static microarray data.

Here the analogue of the symptoms is the library of mid-level perception and action descriptions that the system has available, and that the system judges to be potentially useful in the context of achieving the overall goal.

The judgment of usefulness can be done in terms of probabilities calculated based on the system's history. For instance, since in this case the goal explicitly involves some object beginning in one place and winding up in a different place, simple attention allocation dynamics indicates that schema for Go to X should be relevant. This attention allocation works via simple probability calculation. In the past, when achieving goals involving moving things from one place to another (including moving oneself from one place to another), subroutines involving Go to X have been commonly used. Therefore, when looking at another goal involving moving, the probability of subroutines of the form Go to X being useful is high. Similarly, since the goal involves objects, there is a decent probability that schemata involving picking up objects will be useful.

Once a vocabulary of potentially useful subroutines has been assembled, then the learning process can begin. Note that the learning process is not restricted to the subroutines assembled in advance as potentially promising. It is simply biased in their direction.

The analogue of the CFS strategy, in this context, is to follow a two-stage learning process.

The first stage is to find a schema that, according to reasoning based on experience and logic, will achieve the goal if executed. But express this schema, not in terms of detailed motor actions and perceptions, but in terms of pseudocode whose entries are schema such as Go to the door and pick up the box, i.e. subroutines that look roughly like

atTime
    ExecutionLink GoTo door
    T

and

atTime
    ExecutionLink PickUp box
    T1

Here T and T1 are internal VariableNodes referring to time points within the execution of the overall schema; and door and box are internal VariableNodes referring to the door to the other room involved in the goal, and the box involved in the goal.

The trick is that, at the time this higher-level pseudocode schema is learned, the system doesn't really know how it's going to execute GoTo as applied to door. GoTo and PickUp are, here, assumed to be ungrounded SchemaNodes, each one linked to a number of grounded SchemaNodes. If there are no obstacles between the agent and the door at the time the subroutine GoTo door is encountered, then the system will be able to use an existing grounded SchemaNode linked to the ungrounded GoTo SchemaNode. On the other hand, if there are complex obstacles in the way, and the system has never had to deal with navigating through a field of complex objects before, then the system will have to learn a new grounding for GoTo in the context of this particular goal-achieving schema in which GoTo appears. Note, however, that this learning problem — the second stage of the original learning problem -- is considerably smaller than the problem of learning a schema to carry out the overall goal G. Just as the problem of learning the overall pseudocode program is smaller than the problem of learning a schema to carry out the overall goal G.

The schema carrying out this particular goal, in informal rather than mathematical form, would look something like:

PredictiveImplication
    AND_sequential
        Until see door straight ahead
        Rotate left
        Until distance to door < 3 units
        Move forward
    Achieve goal of finding door

On the other hand, for a more complex task such as [G = move the red block from the table in its current room to the box in the next room], pseudocode would look more like

PredictiveImplication

   ANDsequential
       If possible, locate red block on table  $T
         (using memory of where table $T is located if available, otherwise by scanning the room)
       If not possible, locate table $T
       Go to table $T
       If red block was not located on table $T before,  locate red block on table
       Move as close as possible to red block 
       Pick up red block 
       Locate door to next room (again, using memory if available)
       Go to door 
       Go through door slightly
       Look for box
       Go to box
       Put down block in box
   Achieve G

Note that this simple pseudocode only works if there's only one table in the room, only one door to another room, and if the box has a big opening that covers the whole top of the box. The knowledge representation is supposed to be flexible enough that generalizing the schema to deal with situations that break these assumptions can be done by localized changes.

This pseudocode schema, fully expanded, will involve about 150-200 links, if implemented with appropriate abstractions but without obsessive attention to compactness. As an example of abstraction: Go to door and Go to box shouldn't be implemented as totally separate subprograms; rather there should be a subprogram Go to X and both of the special cases should be implemented in terms of this.

The higher-level learning process — the one that learns the pseudocode program — can be carried out by MOSES learning on Combo trees. However, in this case (unlike the CFS case) the fitness function for this evolutionary learning process must be inferential, since at this stage the system doesn't know exactly how the subcomponents will be carried out. The lower-level learning processes may be carried out by evolutionary learning as well, and the fitness evaluation will be inferential sometimes and direct-evaluational other times, depending on the context.

In some cases, of course, one may have multiple levels of pseudocode, with several levels of abstraction before one gets down to the level of particular perceptions and actions.

Although the discussion has been somewhat abstract, we hope the basic concept of the approach is clear. One learns hierarchical schemata by design, using inferential fitness evaluation to learn the higher levels before the lower levels have been filled in. This is a fairly simple cognitive schema, but we believe it's an essential one.