# OpenCogPrime:MapMining

## Contents

### Mining the AtomSpace for Maps

Searching for general maps in a complex AtomSpace is an unrealistically difficult problem, as the search space is huge. So, the bulk of map-mining activity involves looking for the most simple and obvious sorts of maps. A certain amount of resources may also be allocated to looking for subtler maps using more resource-intensive methods.

The following categories of maps can be searched for at relatively low cost:

- Static maps
- Temporal motif maps

Conceptually, a static map is simply a set of Atoms that all tend to be active at the same time.

Next, by a "temporal motif map" we mean a set of pairs:

(Ai,ti)

of the type:

(Atom, int)

so that for many activation cycle indices T, Ai is highly active at some time very close to index T + ti.

The reason both static maps and temporal motif maps are easy to recognize is that they are both simply repeated patterns.

Perceptual context formation involves a special case of static and temporal motif mining. In perceptual context formation, one specifically wishes to mine maps involving perceptual nodes associated with a single interaction channel. These maps then represent real-world contexts, that may be useful in guiding real-world-oriented goal activity (via schema-context-goal triads).

We have three approaches for mining static and temporal motif maps from AtomSpaces:

- Frequent subgraph mining, frequent itemset mining, or other sorts of datamining on Activity Tables
- Clustering on the network of HebbianLinks
- Evolutionary Optimization based datamining on Activity Tables

The first two approaches are significantly more time-efficient than the latter, but also significantly more limited in the scope of patterns they can find.

Any of these approaches can be used to look for maps subject to several types of constraints:

- Unconstrained: maps may contain any kinds of Atoms
- Strictly constrained: maps may only contain Atom types contained on a certain list
- Probabilistically constrained: maps must contain Atom types contained on a certain list, as x% of their elements
- Trigger-constrained: the map must contain an Atom whose type is on a certain list, as its most active element

Different sorts of constraints will lead to different sorts of maps, of course. We don't know at this stage which sorts of constraints will yield the best results. Some special cases, however, are reasonably well understood. For instance:

- procedure encapsulation, to be discussed below, involves searching for (strictly-constrained) maps consisting solely of ProcedureInstanceNodes.
- to enhance goal achievement, it is likely useful to search for trigger-constrained maps triggered by Goal Nodes.

What the MapEncapsulation CIM-Dynamic does once it finds a map, is dependent upon the type of map it's found.

In the special case of procedure encapsulation, it creates a compound ProcedureNode (selecting SchemaNode or PredicateNode based on whether the output is a TruthValue or not).

For static maps, it creates a ConceptNode, which links to all members of the map with InheritanceLinks, the weight of which is determined by the degree of map membership.

### Map Encapsulation Using Frequent Itemset or Subgraph Mining

In principle one could use evolutionary learning to do all map encapsulation, but this isn't computationally feasible — it would limit too severely the amount of map encapsulation that could be done. Instead, evolutionary learning must be supplemented by some more rapid, less expensive technique.

### Frequent Itemset Mining for Map Mining

One class of technique that is useful here is *frequent itemset mining* (FIM), a process that looks to find all frequent combinations of items occurring in a set of data. There are a number of algorithms in this category, the classic is Apriori (Agrawal and Srikant, 1994) but we have recently been experimenting with an alternative called relim (Borgelt, 2005) which is conceptually similar but seems to give better performance.

Another useful class of algorithms is greedy or stochastic itemset mining, which does roughly the same thing as FIM but without being completely exhaustive (the advantage being greater execution speed). Here we will discuss FIM, but the basic concepts are the same if one is doing greedy or stochastic mining instead.

The basic goal of frequent itemset mining is to discover frequent subsets in a group of items. One knows that for a set of N items, there are 2N-1 possible subgroups. To avoid the exponential explosion of subsets, one may compute the frequent *itemsets* in several rounds. Round i computes all frequent i-itemsets.

A round has two steps: candidate generation and candidate counting. In the candidate generation step, the algorithm generates a set of candidate i-itemsets whose support — a minimum percentage of events in which the item must appear — has not been yet been computed. In the candidate-counting step, the algorithm scans its memory database, counting the support of the candidate itemsets. After the scan, the algorithm discards candidates with support lower than the specified minimum (an algorithm parameter) and retains only the frequent i-itemsets. The algorithm reduces the number of tested subsets by pruning apriori those candidate itemsets that cannot be frequent, based on the knowledge about infrequent itemsets obtained from previous rounds. Although the worst case of this sort of algorithm is exponential, practical executions are generally fast, depending essentially on the support limit.

To apply this kind of approach to search for static maps, one simply creates a large set of sets of Atoms — one set for each time-point. In the set S(t) corresponding to time t, we place all Atoms that were firing activation at time t. The itemset miner then searches for sets of Atoms that are subsets of many different S(t) corresponding to many different times t. These are Atom sets that are frequently co-active.

Table 1 (see end of page for attachment) presents a typical example of data prepared for frequent itemset mining, in the context of context formation via static-map recognition. Columns represent interaction-channel-important nodes and rows indicate time slices. For simplicity, we have *thresholded* the values and show only activity values; so that a 1 in a cell indicates that the Atom indicated by the column was being utilized at the time indicated by the row.

In the example, if we assume minimum support as 50 percent, the context nodes C1 = {Q, R}, and C2 = {Q, T, U} would be created.

Using frequent itemset mining to find temporal motif maps is a similar, but slightly more complex process. Here, one fixes a time-window W. Then, for each activation cycle index t, one creates a set S(t) consisting of pairs of the form:

(A, s)

where A is an Atom and 0<=s<=W is an integer temporal offset. We have:

(A,s)withinS(t)

if Atom A is firing activation at time t+s. Itemset mining is then used to search for common subsets among the S(t). These common subsets are common patterns of temporal activation, i.e. repeated temporal motifs.

The strength of this approach is its ability to rapidly search through a huge space of possibly significant subsets. Its weakness is its restriction to finding maps that can be incrementally built up from smaller maps. How significant this weakness is, depends on the particular statistics of map occurrence in OCP. Intuitively, we believe frequent itemset mining can perform rather well in this context, and our preliminary experiments have supported this intuition.

#### Frequent Subgraph Mining for Map Mining

A limitation of FIM techniques, from an OCP perspective, is that they are intended for relational databases; but the information about co-activity in an OCP instance is generally going to be more efficiently stored as graphs rather than RDB's. Indeed an ActivityTable may be effectively stored as a series of graphs corresponding to time intervals — one graph for each interval, consisting of HebbianLinks formed solely based on importance during that interval. From an ActivityTable stores like this, the way to find maps is not frequent itemset mining but rather frequent subgraph mining, a variant of FIM that is conceptually similar but algorithmically more subtle, and on which there has arisen a significant literature in recent years.

#### Evolutionary Map Detection

Complementary to the itemset mining approach, the OCP design also uses evolutionary optimization to find maps. Here the data setup is the same as in the itemset mining case, but instead of using an incremental search approach, one sets up a population of subsets of the sets S(t), and seeks to evolve the population to find an optimally fit S(t). Fitness is defined simply as high frequency - relative to the frequency one would expect based on statistical independence assumptions alone.

In this context, one probably wishes to use a variant of evolutionary optimization using some diversity preservation mechanism, so that the population contains not merely a single most optimal subset, but a number of unrelated subsets, all of which are reasonably fit. EDA's like MOSES are fairly good in this regard, having much less tendency than GP to converge to highly homogeneous final populations.

#### Map Dynamics

Assume one has a collection of Atoms, with:

- Importance values I(A), assigned via the economic attention allocation mechanism.
- HebbianLink strengths (HebbianLink A B).s, assigned as (loosely speaking) the probability of B's importance assuming A's importance.

Then, one way to search for static maps is to look for collections C of Atoms that are strong clusters according to HebbianLinks. That is, for instance, to find collections C so that:

The mean strength of (HebbianLink A B).s, where A and B are in the collection C >> The mean strength of (HebbianLink A Z).s, where A is in the collection C and Z is not

(this is just a very simple cluster quality measurement; there is a variety of other cluster quality measurements one might use instead.)

Dynamic maps may be more complex, for instance there might be two collections C1 and C2 so that:

Mean strength of (HebbianLink A B).s, where A is in C1 and B is in C2 Mean strength of (HebbianLink B A).s, where B is in C2 and A is in C1

are both very large.

A static map will tend to be an attractor for OCP's attention-allocation-based dynamics, in the sense that when a few elements of the map are acted upon, it is likely that other elements of the map will soon also come to be acted upon. The reason is that, if a few elements of the map are acted upon usefully, then their importance values will increase. Node probability inference based on the HebbianLinks will then cause the importance values of the other nodes in the map to increase, thus increasing the probability that the other nodes in the map are acted upon. Critical here is that the HebbianLinks have a higher weight of evidence than the node importance values. This is because the node importance values are assumed to be ephemeral — they reflect whether a given node is important at a given moment or not — whereas the HebbianLinks are assumed to reflect longer-lasting information.

A dynamic map will also be an attractor, but of a more complex kind. The example given above, with C1 and C2, will be a periodic attractor rather than a fixed-point attractor.