New Fishgram Design, 2013

From OpenCog
Jump to: navigation, search

This page describes a proposed fishgram design that has now been abandoned, unimplemented. Its an archive of a historical attempt at pattern mining.

New Fishgram Design, 2013

This page sketches a design for a new implementation of what's been called "Fishgram" -- Frequent (or Interesting) SubHyperGRAph Mining. (These ideas are pretty much obsoleted by the Pattern Miner, which incorporates them in a slightly different form...)

For an overview of the purpose and design of the prior Fishgram version, see http://agi-conference.org/2012/wp-content/uploads/2012/12/paper_39.pdf ... see also the page Fishgram for more details ...

The prior Fishgram version (designed and implemented by Jade O'Neill) was a valuable prototype for exploring the idea of subhypergraph mining on the Atomspace, but has significant drawbacks: it's not very scalable, and doesn't support online mining (needed to recognize patterns as they emerge in a rapidly changing Atomspace). Thus the desire for a rebuild from the ground up.

The ideas presented on this wiki page have emerged from various discussions over the last year or so, mainly with Shujing Ke and Jade O'Neill.

Background Notes

There exist many "frequent subgraph mining" algorithms in the literature, some with open source code attached, but after extensive investigation it was realized none of these is directly applicable to the problem of Atomspace subhypergraph mining. (For instance, many are focused on extracting common patterns among an ensemble of separate small graphs, whereas what we have in the Atomspace is a single large graph.)

Most frequent subhypergraph mining algorithms extend the ideas of the Apriori or FP-Growth frequent itemset mining algorithms. The Fishgram approach outlines here is also viewable as an extension of the basic idea of FP-Growth

http://en.wikipedia.org/wiki/Association_rule_learning#FP-growth_algorithm

However, the specific way it utilizes the based FP-Growth idea is different than prior subgraph miners, and customized for the Atomspace context.

Key Data Structures

The main part of any frequent sub-whatever mining algorithm is the data structure. One needs a data structure that can be efficiently manipulated and grown, even when it's very large; and that suits the nature of the data one is mining. In the case of subhypergraph mining of the sort we need to do from the Atomspace, this becomes somewhat complicated.

I will describe here a structure I call a "teshdag" (consisting of two dags superposed on the same set of nodes, each node containing a templatized or literal subhypergraph from the Atomspace), which I believe suits the purpose.

First: By a "generalized hypergraph" I mean a structure like an Atomspace, with nodes and k-ary links, and also the possibility for links to point to links.

Next, define a templatized subhypergraph ("tesh") of a generalized hypergraph G as a generalized hypergraph, which is obtainable from G via:

  • extracting a subhypergraph H from G
  • optionally, replacing one or more of the nodes in H with VariableNodes (newly created VariableNodes, not existing elsewhere in G). Call these "meta-variables." These are variables that are created in the course of the pattern mining process, not existing in the Atomspace beforehand.


Define a data structure called a "teshdag" (templatized subhypergraph dag) as follows. Each teshdag node contains within it:

  • a representation of a tesh T
  • a number indicating the importance of T (e.g. in the case that importance is frequency, this is just a count of how many times the tesh T is instantiated in G)

A teshdag has two kinds of directed links between its nodes:

  • extension links
  • instantiation links

An extension link pointing from teshdag node N to teshdag node M, means that N's interior tesh is a subhypergraph of M's interior tesh (i.e., M's tesh adds some nodes and link onto N's tesh)

An instantiation link pointing from teshdag node N to teshdag node M, means that N can be obtained by setting some of M's meta-variables to specific values

Thus, a teshdag is a superposition of two different dags, which have common nodes but different links.

Note that when one creates a new teshdag node N as a child (either an extension child or an instantiation child) of an existing teshdag node M, one must check whether N is also a child of any other existing teshdag nodes (since a teshdag is a dag and not a tree).

The Basic Mining Algorithm

Now I outline an FP-Growth type mining algorithm utilizing a teshdag.

First define two parameters bounding the patterns to be found:

  • Let MVL = the maximum number of metavariables permissible in a pattern to be mined. For very simple patterns we can let MVL=1. MVL=2 lets us get more patterns. The bigger we let MVL get, the bigger the search space gets, obviously.
  • Let SL = the size limit, the maximum number of Links permissible in a pattern to be mined.

Also, we must assume some IMPORTANCE CRITERION, which may be assessed of any tesh. This may be simply "frequency count", i.e. the number of times the tesh is instantiated in G. Or it may be something fancier, such as an information-theoretic criterion. We can use frequency count for initial testing.

First I will describe how to apply the algorithm to an existing Atomspace.

We start with an initial teshdag C, which might consist of the set of all Links in the Atomspace (or might consist of something more restrictive or specialized).

Then we repeat the following two steps over and over, gradually growing the teshdag C:

Step 1: Make instantiation children

Iterate through all the leaf nodes of C, and for each such node N: If N doesn't contain more than MVL meta-variables, then create new instantiation children of N. Each such child is produced by: a) make a new tesh by taking one non-metavariable node of N's internal tesh, and replacing it with a metavariable; b) make an instantiation child of N containing this newly created tesh.

Step 2: Make extension children

Iterate through all the leaf nodes of C; and for each such node N, if the internal tesh is not already at the size limit SL, create a "working tesh" X, defined initially as the internal tesh T of N

Then X is expanded as follows:

  • Iterate through all the instantiations Z of T in G
  • For each such instantiation, take each link that is attached to Z in G (but is not already in X; nor could it be made to appear in X by instantiating one of X's metavariables), and add it to X

After this is done, X now contains a bunch of links in addition to those contained in T.

The subhypergraphs of X (that contain T) that pass the IMPORTANCE CRITERION, should then be used as internal teshes of new extension children of N

...

One repeats Step 1 and 2 over and over, until nothing new is happening because the MVL and SL limits are always hit.

Online Mining

The algorithm above has been described in the context of processing a whole Atomspace in one big gulp, but it can just as well be aplied incrementally.

When a new Atom is added to the Atomspace, it can be added to the MinerAgent's list of Recently Changed Atoms. Periodically, the MinerAgent can then:

  • for each such Atom A, find the teshdag nodes whose internal teshes contain links that equal, or can be instantiated by, links involving A in G
  • carry out the basic mining Steps 1 and 2 mentioned above, but restricted to these teshdag nodes

Variations of this are possible, obviously. If the importance criterion depends on the truth values of Atoms rather than just their presence vs. absence, then an Atom should be added to the Recently Changed Atoms list if its truth value changes significantly, not just if it is newly created.

What is required to enable this sort of processing, is simply the ability to mark certain teshdag nodes as Active. Then, when the algorithm sweeps through the teshdag, it will only expand the Active nodes and ignore the others. So each teshdag node must have a binary flag indicating if it's currently Active or not, or some similar mechanism.

Given this online approach, one could process a whole Atomspace incrementally as it is grown, avoiding the "batch mode" altogether.

Implementation and Deployment Issues

MinerAgent

The algorithm outlined above can be implemented in a MinerAgent that incrementally updates a very large teshdag based on Atom-change events. There should also be an API enabling a batch process for mining patterns from a whole Atomspace at once.

Deployment

This kind of mining is obviously a very costly process in terms of RAM and processing time. On the other hand, it also needs very frequent and intensive access to the Atomspace! Thus, probably an ideal implementation scenario would be to have a separate machine for pattern mining, where this machine contains its own cloned copy of the main Atomspace of the OpenCog instance, as well as the teshdag structure.

Efficiency Issues

As the teshdag will be a very large data structure, it will be important to implement it in an efficient way, with careful regard to garbage collection issues.

Also, the mining algorithm is a natural for multithreaded programming, since different branches of the teshdag can be expanded in different threads. It would be good if the implementation supported this robustly from the start, or at very least were built in such a way that supporting this would not require rearchitecture.