# Idea: Approximate Pattern Matching

# Background and Goals

This page roughly summarizes an approach to approximate pattern matching in the Atomspace, which was conceived in discussion in the OpenCog HK office in late November 2014.

We had two examples in mind while developing this algorithm was..

**Example 1**

Query: The Relex2Logic output from "Ben went to the store with whom ?"

Target: The Relex2Logic output from "Ben and Amen went to the store."

**Example 2**

Query: The Relex2Logic output from "Ben went to the store with whom ?"

Target: The Relex2Logic output from "Ben went to the restaurant with Amen."

...

The goal is to find a quick and dirty approximate matching approach (or set of approaches), to handle cases like this ... an approach that doesn't require subtle semantic transformation of one collection of Relex2Logic output Atoms into another.

An approximate matching algorithm like this is obviously going to be very useful in many cases besides NL query processing, as well.

On careful reflection/analysis, it seems to us that Example 1 and Example 2 are probably best handled by quite different approaches.

Notable in Example 1 is: Both the query and the target contain the key words Ben, went and store .. but connected via quite different sets of links. Whereas in Example 2 the difference between the query graph and the target graph is quite small.

# Approximate Pattern Matching When the Query and Target are Quite Similar

Consider Example 2 above as illustrating the general case where the query and target are very similar but not identical.

In this case, it seems one could apply the Pattern Matcher with a customized callback, using a query of the form "Ben went to the store with $qVar1" ... and then the pattern matcher would visit "Ben went to the restaurant with Amen." at some point and return it as an approximate answer.

However, it seems doubtful this could work for Example 1 above. In Example 1, the query and target are going to look quite different in the Atomspace, so it seems unlikely that, while searching for matches to "Ben went to the store with $qVar1", the Pattern Matcher would find "Ben and Amen went to the store."

# Approximate Pattern Matching When the Query and Target are Fairly Different

Here we sketch a quite different algorithm, not using the Pattern Matcher, that we think may work OK in the case where the query and target contain overlapping Atoms but are pretty differently structured.

Suppose we have a query pattern Q with K key Atoms in it (we'll assume 2<=K<=7, say, for sake of concreteness).

For starters, suppose this query pattern has a single free variable (i.e. the query wants this variable filled in with some "answer").

Step 1 of the algorithm is to find a set of candidate answers. Then Step 2 filters down the candidate set.

Step 1 works as follows.

Given an Atom A, refer to the set of Atoms separated from A by d or fewer links as the d-neighborhood of A.

For m=1 to (say) 3:

- Find the intersection of the m-neighborhoods of all the K key Atoms
- If this is empty, find Atoms that live in the intersection of as many of the K key Atoms' m-neighborhoods as possible

If good results are found for m=1, no need to proceed to m=2. Etc.

Step 1 will result in a list of Atoms that are not too far from all or most of the K key Atoms.

For each Atom B found in Step 1, Step 2 works as follows:

- Find the minimal Atom graph G that contains B and the key Atoms in whose m-neighborhoods B was found to live

- Calculate the edit distance between G and the query Q (see note below for details on this.)

...

This should work if Atoms don't tend to have that many links. If Atoms have a lot of links, then we need to get trickier, and we need to filter the m-neighborhoods before calculating their intersection. The most natural way to do this filtering, in an OpenCog context, is as follows:

- For N cognitive cycles, spread STI from the K key Atoms
- After this has been done for N cycles, stop spreading and note the STI values of the Atoms in the Atomspace
- Filter the m-neighborhoods in Step 1 above, via taking only the R Atoms in each m-neighborhood, that have the highest STI

This gives a scalable method.

...

If the query has two free variables in it, Step 1 works the same way. But then Step 2 has to look at many pairs of Atoms (pairs where each Atom is an output of Step 1), and find a graph containing each pair plus the K query Atoms. There may be too many pairs to handle; so then some heuristic method may be used to choose which pairs to use. For example:

- Spread STI from the Atoms found in Step 1, for N cycles
- Build HebbianLinks between these Atoms, based on their co-activation during these cycles of STI spreading
- Preferentially look at pairs that have acquired high-weight HebbianLinks between them

## Graph Edit Distance

Defining and implementing edit distance between graphs is nontrivial. But there are research papers explaining how other folks did it, and even some available code, so I guess not much algorithmic innovation is needed to handle this currently.

Survey paper:

Code for one method: