Learning Schema Maps
(aka, Reinforcement Learning via HebbianLinks and PredictiveImplicationLinks)
Finally it is time to plunge into the issue of procedure maps — schema maps in particular. A schema map is a simple yet subtle thing — a subnetwork of the AtomSpace consisting of SchemaNodes, computing some useful quantity or carrying out some useful process in a cooperative way. The general purpose of schema maps is to allow schema execution to interact with other mental processes in a more flexible way than is allowed by compact Combo trees with internal hooks into the AtomSpace. Sometimes this isn't necessary — sometimes a procedure can just be executed with no feedback from other aspect of the mind — and in these cases a grounded SchemaNode, fully encapsulating the procedure, is the most efficient solution. And sometimes interaction with the rest of the mind is necessary, but can be captured by an appropriate ProcedureEvaluator object. But sometimes execution needs to be even more highly interactive, mediated by attention allocation and other OCP dynamics in a flexible way, and in those cases, one needs a network of ProcedureNodes rather than a single encapsulated ProcedureNode.
But how can schema maps be learned? The basic idea is very simple, and unoriginal: reinforcement learning. Going back at least to Pavlov, the concept couldn't be more natural. In a goal-directed system consisting of interconnected, cooperative elements, you reinforce those connections and/or those elements that have been helpful for achieving goals, and weaken those connections that haven't. Thus, over time, you obtain a network of elements that achieves goals effectively.
In spite of the intuitive appeal of the idea, the long history of reinforcement learning is mostly not a good one. Reinforcement learning has never yielded really good performance on realistic-scale problems, in spite of significant efforts. Of course, few AI methods have ever fulfilled their creators' dreams (yet!), but reinforcement learning has fallen further.
The central difficulty in all reinforcement learning approaches is the 'assignment of credit' problem. If a component of a system has been directly useful for achieving a goal, then rewarding it is easy. But if the relevance of a component to a goal is indirect, then things aren't so simple. Measuring indirect usefulness in a large, richly connected system is difficult — inaccuracies creep into the process easily, and this, in a nutshell, is why reinforcement learning methods seem to work poorly.
Perhaps the most sophisticated approach to reinforcement learning in any existing AI system is found in John Holland's classifier systems (1992). Classifier systems contain sets of rules, where successful rules are reinforced via a mechanism called 'the bucket brigade' (actually, the first classifier systems used a more complex mechanism, but the bucket brigade approach is the standard in this field now). The bucket brigade algorithm has been shown to be formally equivalent to a variant of reinforcement learning called Q-learning (Sutton and Barto, 1998). Interesting though classifier systems are, though, they have never proved useful in practical applications (unlike Holland's earlier invention, the genetic algorithm). They are phenomenally hard to tune, and don't scale well in the face of complex situations. Baum's Hayek system — a modification of classifier systems based on principles from theoretical economics — removes the major problems with the bucket brigade in theory and in some simple examples, but operates very slowly and appears to have difficulty scaling beyond relatively small 'toy problems.'
There has also been some recent research regarding the use of Bayesian probability theory to carry out reinforcement learning (Sutton and Barto, 1998). However, the work in this vein has suffered from the lack of a probabilistic inference approach capable of dealing with complex networks of events. Complex statistical formulas are proposed that nevertheless only work in special cases far simpler than those confronting actual embodied minds.
[NOTE: the prior paragraphs constitute a brief and inadequate review and could certainly be expanded into a whole page giving an AGI-focused review of reinforcement learning approaches — Ben G]
We believe that the OCP framework provides the 'support mechanisms and structures' that reinforcement learning needs in order to manifest its potential. We believe that the use of HebbianLinks and PredictiveImplicationLinks for reinforcement learning is much more powerful than approaches that have been taken in the past, precisely because explicit reinforcement learning is not the only process acting on these Links.
Earlier we reviewed the semantics of HebbianLinks, and discussed two methods for forming HebbianLinks:
- Updating HebbianLink strengths via mining of the System Activity Table
- Logical inference on HebbianLinks, which may also incorporate the use of inference to combine HebbianLinks with other logical links (for instance, in the reinforcement learning context, PredictiveImplicationLinks)
We will now describe how HebbianLinks, formed and manipulated in this manner, may play a key role in goal-driven reinforcement learning. In effect, what we will describe is an implicit integration of the bucket brigade with PLN inference. The addition of robust probabilistic inference adds a new kind of depth and precision to the reinforcement learning process.
Goal Nodes have an important ability to stimulate a lot of SchemaNode execution activity. If a goal needs to be fulfilled, it stimulates schemata that are known to make this happen. But how is it known which schemata tend to fulfill a given goal? A link:
PredictiveImplicationLink S G
means that after schema S has been executed, goal G tends to be fulfilled. If these links between goals and goal-valuable schemata exist, then activation spreading from goals can serve the purpose of causing goal-useful schemata to become active.
The trick, then, is to use HebbianLinks and inference thereon to implicitly guess PredictiveImplicationLinks. A HebbianLink between S1 and S says that when thinking about S1 was useful in the past, thinking about S was also often useful. This suggests that if doing S achieves goal G, maybe doing S1 is also a good idea. The system may then try to find (by direct lookup or reasoning) whether, in the current context, there is a PredictiveImplication joining S1 to S. In this way Hebbian reinforcement learning is being used as an inference control mechanism to aid in the construction of a goal-directed chain of PredictiveImplicationLinks, which may then be schematized into a contextually useful procedure.
Note finally that this process feeds back into itself in an interesting way, via contributing to ongoing HebbianLink formation. Along the way, while leading to the on-the-fly construction of context-appropriate procedures that achieve goals, it also reinforces the HebbianLinks that hold together schema maps, sculpting new schema maps out of the existing field of interlinked SchemaNodes.
Goal-Directed Compound Schema Evolution
Finally, as a complement to goal-driven reinforcement learning, there is also a process of goal-directed compound SchemaNode evolution. This combines features of the goal-driven reinforcement learning and concept-driven schema evolution methods discussed above. Here we use a Goal Node to provide the fitness function for schema evolution.
The basic idea is that the fitness of a schema is defined by the degree to which enactment of that schema causes fulfillment of the goal. This requires the introduction of CausalImplicationLinks, as defined in Probabilistic Logic Networks. In the simplest case, a CausalImplicationLink is simply a PredictiveImplicationLink.
One relatively simple implementation of the idea is as follows. Suppose we have a Goal Node G, whose satisfaction we desire to have achieved by time T1. Suppose we want to find a SchemaNode S whose execution at time T2 will cause G to be achieved. We may define a fitness function for evaluating candidate S by:
f(S,G,T1,T2) = w * r(S,P,T1,T2) + (1-w) size(S) 0 < w < 1 a weight, e.g. w=.5 r(S,P,T1,T2) = GetStrength CausalImplicationLink EvaluationLink AtTime T1 ExecutionLink S * * EvaluationLink AtTime (T2, G)
Another variant specifies only a relative time lag, not two absolute times.
f(S,G,T) = w * v(S,P,T) + (1-w) size(S) 0 < w < 1 a weight, e.g. w=.5 v(S,P,T) = AND NonEmpty SatisfyingSet r(S,P,T1,T2) T1 > T2 - T
Using evolutionary learning to find schemata fulfilling these fitness functions, results in compound SchemaNodes whose execution is expected to cause the achievement of given goals. This is a complementary approach to reinforcement-learning based schema learning, and to schema learning based on PredicateNode concept creation. The strengths and weaknesses of these different approaches need to be extensively experimentally explored. However, prior experience with the learning algorithms involved gives us some guidance.
We know that when absolutely nothing is known about an objective function, evolutionary programming is often the best way to proceed. Even when there is knowledge about an objective function, the evolution process can take it into account, because the fitness functions involve logical links, and the evaluation of these logical links may involve inference operations.
On the other hand, when there's a lot of relevant knowledge embodied in previously executed procedures, using logical reasoning to guide new procedure creation can be cumbersome. The Hebbian mechanisms used in reinforcement learning may be understood as inferential in their conceptual foundations (since a HebbianLink is equivalent to an ImplicationLink between two propositions about importance levels). But in practice they provide a much-streamlined approach to bringing knowledge implicit in existing procedures to bear on the creation of new procedures. Reinforcement learning, we believe, will excel at combining existing procedures to form new ones, and modifying existing procedures to work well in new contexts. Logical inference can also help here, acting in cooperation with reinforcement learning. But when the system has no clue how a certain goal might be fulfilled, evolutionary schema learning provides a relatively time-efficient way for it to find something minimally workable.
Pragmatically, the GoalDrivenSchemaLearning CIM-Dynamic handles this aspect of the system's operations. It selects Goal Nodes with probability proportional to importance, and then spawns problems for the Evolutionary Optimization Unit Group accordingly. For a given Goal Node, PLN control mechanisms are used to study its properties and select between the above objective functions to use, on an heuristic basis.