FISHGRAM (Frequent Interesting SubHyperGRAph Mining)
Initial version by Jade O'Neill, July 16 2012
Fishgram is an efficient algorithm for finding patterns in OpenCog knowledge. It represents patterns as a conjunction (AndLink) of OpenCog Links, which usually contain variables. It does a greedy search, so it can quickly find many patterns. In contrast algorithms like MOSES are designed to find a small number of the best patterns. Fishgram works by finding a set of objects that have links in common, so it will be most effective if the AtomSpace has a lot of raw data, with simple patterns. For example, it can be used on the perceptions from the virtual world. There are predicates for basic perceptions (e.g. what kind of object something is, objects being near each other, types of blocks, and actions being performed by the user or the AI).
Here is some example output from Fishgram, when run on the virtual agent's memories.
(AndLink (EvaluationLink is_edible:PredicateNode (ListLink $1000041)) (InheritanceLink $1000041 Battery:ConceptNode) )
This means a battery which can be “eaten” by the virtual robot. The variable $1000041 refers to the object (battery).
Fishgram can also find patterns containing a sequence of events. In this case, there is a list of EvaluationLinks or InheritanceLinks which describe the objects involved, followed by the sequence of events.
(AndLink (InheritanceLink $1007703 Battery:ConceptNode) (SequentialAndLink (EvaluationLink isHolding:PredicateNode (ListLink $1008725 $1007703))) ) )
This means the agent was holding a battery. $1007703 is the battery, and there is also a variable for the agent itself ($1008725). Many interesting patterns involve more than one object. This pattern would also include the user (or another AI) holding a battery, because the pattern does not refer to the AI character specifically.
It can find patterns where it performs an action and achieves a goal. There is code to create implications based on these conjunctions (in Fishgram.implications(). Fishgram.output_causal_implications_for_last_layer() outputs causal patterns using a postprocessing system, which uses a conjunction to create a nested structure of ImplicationLinks, PredictiveImplicationLinks and SequentialAndLinks). After finding many conjunctions, it can produce ImplicationLinks based on some of them. Here is an example where the AI-controlled virtual robot discovers how to get energy.
(ImplicationLink (AndLink (EvaluationLink is_edible:PredicateNode (ListLink $1011619)) (InheritanceLink $1011619 Battery:ConceptNode) ) (PredictiveImplicationLink (EvaluationLink actionDone:PredicateNode (ListLink (ExecutionLink eat:GroundedSchemaNode (ListLink $1011619)))) (EvaluationLink increased:PredicateNode (ListLink (EvaluationLink EnergyDemandGoal:PredicateNode))) ) )
initial layer = every pair (relation, binding) while previous layer is not empty: foreach (conjunction, binding) in previous layer: let incoming = all (relation, binding) pairs containing an "distinguished entity" in the conjunction let possible_next_events = all (event, binding) pairs where the event happens during or shortly after the last event in conjunction foreach (relation, relation_binding) in incoming and possible_next_events: (new_relation, new_conjunction_binding) = map_to_existing_variables(conjunction, binding, relation, relation_binding) if new_relation is already in conjunction, skip it new_conjunction = conjunction + new_relation if new_conjunction has been found already, skip it otherwise, add (new_conjunction, new_conjunction_binding) to the current layer map_to_existing_variables(conjunction, conjunction_binding, relation, relation_binding) r', s' = a copy of the relation and binding using new variables foreach variable v, object o in relation_binding: foreach variable v2, object o2 in conjunction_binding: if o == o2: change r' and s' to use v2 instead of v return r',s
There are several preprocessing steps to make it easier for the main Fishgram search to find patterns. (They are performed by the ForestExtractor class.) There is a list of things that have to be variables (ForestExtractor.is_object). For example, any predicate that refers to object (including agents) will be given a variable so it can refer to any object. Other predicates or InheritanceLinks can be added to a pattern, to restrict it to specific kinds of objects, as shown above. So there is a step which goes through all of the links in the AtomSpace, and records a list of predicates with variables. Such as “X is red” or “X eats Y”. This makes the search part simpler, because it never has to decide whether something should be a variable or a specific object.
There is also a filter system, so that things which seem irrelevant can be excluded from the search (ForestExtractor.unwanted_atoms and others). There is a combinatorial explosion as patterns become larger. Some predicates may be redundant with each other, or known not to be very useful. It can also try to find only patterns in the AI's “attentional focus,” which is much smaller than the whole AtomSpace.
The Fishgram algorithm cannot currently handle patterns involving numbers, although it could be extended to do so. The two options would be to either have a separate discretization step, creating predicates for different ranges of a value. Or alternatively, have predicates for mathematical operators. It would be possible to search for a “splitpoint” like in decision trees. So a number would be chosen, and only things above that value (or only things below that value) would count for a pattern. It would also be possible to have multiple numbers in a pattern, and compare them in various ways. It is uncertain how practical this would be in Fishgram. MOSES is good for finding numeric patterns, so it may be better to simply use those patterns inside Fishgram.
The “increased” predicate is added by a preprocessing step (notice_changes()). The goals have a fuzzy TruthValue representing how well the goal is achieved at any point in time, so the EnergyDemandGoal represents how much energy the virtual robot has at some point in time. The predicate records times that a goal's TruthValue increased. This only happens immediately after doing something to increase it, which helps avoid finding spurious patterns.
The search process
The fishgram search is breadth-first. (The main method is Fishgram.run() It starts with all predicates (or InheritanceLinks) found by the preprocessing step. Then it finds pairs of predicates involving the same variable. Then they are extended to conjunctions of three predicates, and so on. Many relations apply at a specific time, for example the agent being near an object, or an action being performed. These are included in a sequence, and are added in the order they occurred.
Fishgram remembers the examples for each pattern. If there is only one variable in the pattern, an example is a single object; otherwise each example is a vector of objects for each variable in the pattern. Each time a relation is added to a pattern, if it has no new variables, some of the examples may be removed, because they don't satisfy the new predicate. It needs to have at least one variable in common with the previous relations. Otherwise the patterns would combine many unrelated things.
In frequent itemset mining (for example the APRIORI algorithm), there is effectively one variable, and adding a new predicate will often decrease the number of items that match. It can never increase it. The number of possible conjunctions increases with the length, up to some point, after which it decreases. But when mining for patterns with multiple objects there is a much larger combinatorial explosion of patterns. Various criteria can be used to prune the search.
The most basic criterion is the frequency. Only patterns with at least N examples will be included, where N is an arbitrary constant. You can also set a maximum number of patterns allowed for each length (number of relations), and only include the best ones. The next level of the breadth-first search will only search for extensions of those patterns.
It would be more interesting to use a measure of statistical interestingness, to make sure the relations in a pattern are correlated with each other. There are many spurious frequent patterns, because anything which is frequent will occur together with other things, whether they are relevant or not. For example “breathing while typing” is a frequent pattern, because people breathe at all times. But “moving your hands while typing” is a much more interesting pattern. As people only move their hands some of the time, a measure of correlation would prefer the second pattern. The best measure may be interaction information, which is a generalisation of mutual information that applies to patterns with more than two predicates. An early-stage AI would not have much knowledge of cause and effect, so it would rely on statistical measures to find useful patterns.
Comparison to other algorithms
Fishgram is more suitable for OpenCog's purposes than existing graph mining algorithms, such as GSpan. Most other graph mining algorithms were designed with molecular datasets in mind. The OpenCog AtomSpace is a different graph in various ways. For one, there are many possible relations between nodes (much like in a semantic network). Many relations involve more than two objects, and there are also properties – predicates about a single object. So the relations are effectively directed links of varying arity. It also has events, and many states can change over time (e.g. an egg changes state while it's cooking). Fishgram is designed for general knowledge in an embodied agent.
There are other major differences. Fishgram uses a breadth-first search, rather than depth-first search. And it does an “embedding-based” search, searching for patterns that can be embedded multiple times in a large graph. Molecular datasets have many separate graphs for separate molecules, but the embodied perceptions are closer to a single, fairly well-connected graph. Depth-first search would be very slow on such a graph, as there are many very long paths through the graph, and the search would mostly find those. Whereas the useful patterns tend to be compact and repeated many times.
Lastly the design of Fishgram makes it easy to experiment with multiple different scoring functions, from simple ones like frequency to much more sophisticated functions such as interaction information.