Pattern matching

From OpenCog
(Redirected from Cog-bind)
Jump to: navigation, search

OpenCog has a pattern matcher or query engine or variable unifier that can be used to search the atomspace for specific patterns or arrangements or 'templates' of atoms. After specifying some (arbitrarily complex) arrangement of atoms, that is, a hypergraph, the pattern matcher can find all instances of that hypergraph in the atomspace. The pattern or template can have "holes" in it -- locations that are variable -- and so the pattern matcher can act to "fill in the blanks" when presented with a pattern that has blanks in it. For example, a pattern such as "John threw a _____." can be solved or grounded if the atomspace contains the sentence "John threw a ball." In this example, the solution or grounding is "ball". Patterns may have arbitrary numbers of blanks (variables) arranged in arbitrary ways, and may have multiple solutions. Examples of patterns can be found on the BindLink page. The BindLink provides an easy-to-use, convenient API to the pattern matcher. Below follows a more technical description.

It is a unifier, in the theoretical computer-science sense of Unification, because that is what it does. It is a query engine, because the act of performing unification over a database of relations is the same thing as querying that relational database, such as with SQL. See relational algebra in wikipedia, for example. The difference here is that OpenCog relations are expressed as hypergraphs, so that the pattern matcher actually implements a hypergraph query language (HQL) with the BindLink. Its a pattern matcher because it does pattern matching. These are three different names for the same thing.

The book Term Rewriting and All That, Franz Baader, Tobias Nipkow, 1998, Cambridge University Press(GoogleAmazon) provides a good intro (see page 80 of paperback edition). What is implemented here is more sophisticated: it unifies multiple terms at once, and automatically handles unordered terms (terms that can be in any order), provides support for quotation, execution and evaluation. A "term" is any "link", in the sense of "term algebra". It is useful to think of "terms" because, in a term algebra, the notion of variables and arity & etc are -well-developed. In the pattern matcher, a "clause" is a top-level term, that is, a pattern is a single AndLink that wraps N clauses.

The pattern matcher provides support for the concept of negation as failure from autoepistemic logic. That is, a match can be accepted only if it does not contain a specific sub-pattern or set of clauses.

In many ways, the pattern matcher resembles a hygenic macro language; one might even say that it *is* a macro language. See:

The difference, however, is this: a macro expansion can be applied only once. By contrast, patterns are more like production rules, and can be applied repeatedly, whenever the data fits.

The pattern matcher can be thought of as a building block for a:

The pattern matcher is meant to be used as a building block for more sophisticated functions. It is meant to be the foundation on which to build forward and backward chainers, as well as the (user-defined) indexes needed to operate these (e.g. to implement the Rete algorithm or variants).

Pattern matching has a dual, which can be loosely termed as pattern recognition. Thus, if pattern matching is a search for all data that fit a pattern, then pattern recognition is the search for all patterns that might fit some data. In atomese, this dual function is implemented by the DualLink, and is thus an implementation of a kind-of "dynamic" Rete algorithm. The pattern recognizer is built on the same infrastructure as the pattern matcher; it simply reverses the direction of the "arrows", allowing a search for production rules, rather than their application.

The pattern matcher has NOT been influenced by miniKanren, but probably should be.

The pattern matcher has both C++ and scheme interfaces. Most users will want to employ the scheme interfaces; the C++ API is really meant for core developers who are extending the system.

See also:

Overview

The pattern matcher consists of several components that can be used separately or together to perform any of several related tasks:

  • Perform graph re-writing. That is, if a (dynamic) match is found, then a new and different graph is created as a result. This can be thought of as a single-step forward-chainer or graph re-write rule applier.

At the core is a component that is able to compare one hypergraph to another, and find a match, if one exists. If one (or more) matches are found, then the corresponding variables in the match are grounded. This is proper, core pattern matcher. The determination of what constitutes a "match" is performed by user-defined callbacks, thus making the pattern matcher highly customizable. A default callback provided; it will accept "syntactical" matches: it implements the simplest form of matching as one might expect to have ("free theory" or "empty theory" unification).

One of the callbacks is the 'initializer' or 'traverser', which specifies the set of graphs that will be examined for a match. The default callback traverses the entire AtomSpace, looking for suitable places where the pattern match can be started. Thus, the default callback, when used with the pattern matcher, can locate all matching patterns in the AtomSpace.

Another component, the 'instantiator', will create an instance of a graph, given a pattern containing variables, and a map giving the substitution of values for variables.

If desired, a capable C++ programmer can use the core matcher without having to use the default initializer/traverser. For example, one could limit the sub-hypergraph search to some subset of the AtomSpace, instead of all of it. A capable C++ programmer could also avoid using the instantiator, and instead create some completely different callback, to do something completely different. All three of these components are implemented in the libquery module, with source code in /opencog/query. However, in essentially all situations, this kind of modification is not needed. The triumverate of these three components combined is sufficient to handle many or most interesting cases. No C++ code is required. If you think that this is not enough for your purposes, please contact User:Linas to discuss.

The SatisfactionLink can be used to perform atomspace queries, and dynamic evaluations. Graph re-writing can be done with the BindLink. BindLinks specify "rules" (in the sense of a graph rewrite rule), and apply those rules to the AtomSpace. Each "rule" is an ImplicationLink, given the interpretation "if P then Q", where P and Q are hypergraphs. If the hypergraph P is matched, then the graph Q is instantiated. The graph Q may be any hypergraph; but if Q is an ExecutionLink, then a match results in some program/function being executed. This is a very powerful feature, because an general, generic computation can be performed by the ExecutionLink. This is documented, together with practical examples, in the BindLink page.

Performance and flexibility

The pattern matcher was designed to was designed to be a flexible, fast, easy-to-use, configurable unifier with a bunch of unit tests that assure correctness.  Because it's written in C++, it should have excellent performance while traversing hypergraphs. However, it has a significant setup cost, so, for small, simple unification problems, hand-written snippets of code, even in scheme or python, can outperform it.  But once a problem starts getting complex, right around the time you get to 2, 3, 4 variables, with complicated match criteria, and start hitting the various backtracking issues that arise during such searches, this is where the pattern matcher comes to its own.

Custom behaviors

The pattern matcher was designed to was designed to be a flexible. Below are some remarks about how certain tasks can be achieved.

Modifying Truth Values

The pattern matcher will not in any way alter the truth values or attention values associated with the various atoms in the query pattern, or in the groundings. The clever user can write custom procedures to compute and modify these as desired.

This may be done in one of two ways. One way is to work directly with the values returned by cog-bind. Another, and usually a more convenient way, is to specify an ExecutionOutputLink as the 'return value'.

Triggering arbitrary actions

When an ExecutionOutputLink appears as the consequent of a match, it will cause a named (python or scheme) script to run. This script can perform arbitrary actions of any sort.

Pattern rejection, negation

The PM provides a mechanism for accepting matches only if they don't contain some specific pattern. This negation can be handled in several different ways: either the indicated pattern must not occur in the Atomspace, or it can occur with (for example) a low truth or attention value.

This feature can be understood as support for the concept of negation as failure in autoepistemic logic.

It is implemented by making use of "optional" clauses. If none of the optional clauses appear in the final solution, then the solution is accepted. If some of the optional clauses do appear, then a callback is invoked to determine if the pattern should be accepted or rejected. The callback can make arbitrary decision criteria to implement the concept of "negation".

There is a default callback, called the "crisp logic" callback, which accepts a pattern only if the optional clauses are either missing, or have a truth value whose strength is zero (i.e. are false).

Limiting search

The default callbacks behave in such a way that the entire AtomSpace is "searched" for a match. It is straight-forward to write callbacks that will limit the search in some way. The easiest and most useful way of limiting search might be to check the attention value of an atom, and reject a match if it is not high enough. This way, the only part of the AtomSpace that is searched is that with a high AV.

Because the callbacks can perform arbitrary actions, there are other ways in which the search could be limited.

The Simplified API

The pattern matching engine provides the most general and powerful programming interface, but it can be complicated to use, and is (currently) available only from C++. The "easy-to-use" pattern matcher, described above, has both a C++ and a scheme API. The C++ API can be found in /opencog/query/BindLinkAPI.h . The scheme API is described below.

The scheme API is implemented with three different, related functions: cog-satisfy, cog-satsisfying-set and cog-bind.

The cog-satisfy function takes a single handle, and evaluates, executes and/or performs a grounding, depending on whether the atom contains GroundedPredicateNodes, GroundedSchemaNodes and/or VariableNodes. It returns a truth value of true, if satisfaction was attained, else it returns false. The cog-satisfying-set function does the same, except that it returns a SetLink of the instantiated patterns.

Although the argument can be any atom, there is a slight boost in performance if the argument is wrapped in a SatisfactionLink. This is because certain intermediate results are cached when the SatisfactionLink is used; this caching is particularly important if the pattern search is to be done repeatedly. If the types of the variables in the expression need to be constrained with a VariableTypeNode, then the SatisfactionLink must be used.

A SatisfactionLink consists of two parts: optional variable declarations, and a body. The body specifies the pattern to be grounded (match-pattern). Thus, the general structure is:

 SatisfactionLink
   <var-decls>       (optional)
   <match-pattern>

The BindLink is similar, except that it can also perform graph re-writing; in this case, the body should be composed of a predicate, interpreted as the pattern to search for, and an implicand specifies how the search results are to be instantiated.

 BindLink
   <var-decls>       (optional)
   <match-pattern>
   <instantiated-pattern>

The cog-satisfy function treats the SatisfactionLink as a database query; the atomspace is searched for all graphs that match the specified query. Considered as a scheme function, cog-satisfy also has a scheme-valued return value: it returns a truth value. The cog-satisfying-set function does the same, except that it returns a list of all of the graphs that were found.

The cog-bind function treats the BindLink as a graph re-write rule: it performs a search; if a match is found, then the re-write is applied and placed into the atomspace. The result of applying the re-write rule may be zero, one or more instantiated patterns that were created in the atomspace. Considered as a scheme function, cog-bind also has a scheme-valued return value: it returns a list of all of the patterns that were instantiated.

The instantiated pattern can be simple, or very complex. To be effective as a graph-re-write rule, the BindLink should typically specify some "complex" pattern, as otherwise, no re-writing occurs. However, the re-writing aspect can also be completely ignored, and cog-bind can simply be used to obtain matches, while skipping the notions of 'instantiation' and 'execution' described above. In this case, the instantiated pattern can simply be a single variable, which would result in no change at all to the atomspace. So, for example, to find and return a list of all of the animals in the atomspace, one may perform the following:

(define find-animals
  (BindLink
    ;; The variable to be bound
    (VariableNode "$var")
    ;; The pattern to be searched for
    (InheritanceLink
       (VariableNode "$var")
       (ConceptNode "animal")
    )
    ;; The value to be returned.
    (VariableNode "$var")
  )
)
 
;; Run the above pattern
(cog-bind find-animals)

The result of running the pattern matcher will be a list of all possible groundings for the variable $var. So, for example, if the atomspace contains:

(InheritanceLink (ConceptNode "fox") (ConceptNode "animal"))
(InheritanceLink (ConceptNode "skunk") (ConceptNode "animal"))

then cog-bind will return a SetLink with the two animals:

(ListLink (ConceptNode "fox") (ConceptNode "skunk"))

Because the implicand was simply a single, naked VariableNode, the instantiator had no work to do; the atomspace was not altered, and the list of results was returned to the scheme caller. The variable declaration was optional: the following would provide the same results:

(define find-animals
  (BindLink
    ;; The pattern to be searched for
    (InheritanceLink
       (VariableNode "$var")
       (ConceptNode "animal")
    )

    ;; The value to be returned.
    (VariableNode "$var")
  )
)
 
;; Run the above pattern
(cog-bind find-animals)

To get a list of the satisfied clauses, the BindLink and ImplicationLink can be dispensed with, entirely:

(define find-animal-inheritance
     ;; The pattern to be searched for
     (InheritanceLink
        (VariableNode "$var")
        (ConceptNode "animal")
    )
)
 
;; Run the above pattern
(cog-satisfying-set find-animal-inheritance)

Design

The pattern matcher is written in C++, and has multiple different components, which can be used in a variety of different ways, for many different tasks. It can be customized or adapted to different uses by implementing new and different handlers for callbacks. Thus, it is difficult to understand what the pattern matcher "is", or what it "does", without understanding the components it is built out of, and without understanding the current default settings for the callbacks. This section provides an overview of the design and how to work with it.

The Engine

The core, underlying matching is implemented in class PatternMatchEngine. The main entry point is class PatternMatchEngine::match(). It takes four inputs: a list of variables, a list of clauses, a list of optional clauses (negations), and a user-defined callback class.

The callbacks govern the "semantics" of matching, and are completely user-definable. By providing an appropriate callback, the user can impose any semantics whatsoever on the concept of "what constitutes a match". The engine merely presents candidates to the callbacks, but it is up to the callbacks to accept or reject candidates. Thus, for example, the default callback implements "negation semantics" for the optional clauses: it rejects a match if any of the optional clauses are present. These are discussed in greater detail, below.

The list of variables is the set of bound variables that will appear in the clauses and the negations. Variables can "match anything"; more specifically, if some atom is discovered where the variable would be, then the user-defined callback is called to report the potential match. The callback can accept or reject the match; if it accepted, then that match is a "grounding" of that variable.

The list of clauses determine the hypergraph to be matched. Each clause should be thought of as a directed tree, the edges of the tree corresponding to the outgoing sets of the atoms; the terminal leaves of the tree are Nodes, and the internal vertexes of the tree are Links. The list of clauses should be "connected" to one-another: that is, each clause should have at least one node that also appears in another clause. The engine walks these trees, attempting to match them up to the contents of the atomspace; a match is declared only if each of the clause trees match some tree in the atomspace, connected the same way at the nodes, and having had a successful, consistent grounding of each of the variables. If a match has been found, then a user-defined callback is called, reporting this match. The callback can "accept" or "reject" the match. If the callback "accepts" the match, then all further searching is halted; if it is rejected, then a search for other matches continues. The user-defined callback is free to do as it wishes: for example, if it wants to find all matches, it could simply record the match that was found, "reject" it, and then wait for more matches.

The optional clauses are lists of clauses that may or may not be present to satisfy a match. The default callback implements negation semantics for the optional clauses: that is, it explicitly rejects a match if any of the optional clauses are present. Other, user-defined callbacks are free to implement a different semantics for the optional clauses.

Callbacks

The callbacks are defined by a virtual base class class PatternMatchCallback. It has a concrete implementation in class DefaultPatternMatchCB, which does "something reasonable" for most users. Anyone can write a custom callback class to do something different. The callback class has many different callbacks:

  • node_match() -- this is called whenever two nodes are compared. The callback can accept or reject the match. The default is to accept a match if the two atom handles are identical. A non-default behavior might be more general, for example, allowing matches based on type, or based on truth values, or any other criteria.
  • variable_match() -- this is called when a variable appearing in a clause is to be matched to a candidate grounding. The default callback accepts a match if the candidate node is one of a given set of types (as described in the BindLink page -- that is, the default callback implements the BindLink semantics).
  • link_match() -- this is called to compare links. The default callback accepts a match if the link types and the link arities agree. The default callback does not compare the actual outgoing sets: this is the engine's job. A user-defined callback can take any action here, if desired, including modifying or computing truth values, etc.
  • clause_match() -- called when a single clause has been matched. This callback can accept or reject the match, or take the opportunity to compute truth values. The default callback always accepts matches.
  • optional_clause_match() -- called when an optional clause is matched. The default callback implements "negation semantics": it always rejects optional clauses. The central idea of negation is that a match is possible only if the negated clause is absent. Other callbacks can be used to define different kinds of optional-clause semantics. In particular, the very idea of "negation" depends on the interpretation of the strength and confidence of a truth value: "negation" might not make sense if the confidence is zero. Similarly, perhaps an optional clause should be rejected only if it is mostly "true", and accepted if it is sufficiently "false". The default callback implements crisp-logic semantics for the negation.
  • solution() -- called when all clauses have been matched (and all optional clauses have been appropriately handled). This provides an opportunity for the callback to record the solution, and to either terminate or continue the search.

In addition to the above, there are three informational callbacks:

  • push() -- called after a clauses has been fully grounded. This allows the callback class to save partial state onto a stack, if desired. The pattern matcher is a stack machine; as it advances, it maintains a state stack; when it retreats (back-tracks), it pops its own internal stack. This callback provides an opportunity for the user to deal with back-tracking as well; for example, it may want to maintain a cumulative truth value, or attention value; clause by clause. When forced to backtrack, it would then need to be able to discard partial results. This would be done with the pop() callback, below.
  • pop() -- called when back-tracking.
  • set_type_restrictions() -- called very early, before any pattern matching has commenced. It conveys the variable typemap that was specified by the BindLink.

In addition to the above, there is one very important callback that allows the user to limit or guide the search; that is, to determine the domain of the search.

  • perform_search() -- called when the engine is invoked. This callback must then turn around and invoke the main entry point to the engine: the PatternMatchEngine::do_candidate() method, as many times as it needs to, to perform the search over the desired domain. That is, the engine itself does not know where to start the search, and even when told where to start, it will not necessarily perform an exhaustive search. This callback allows the user to specify the initial start point, as well as to take the needed actions to make sure that the desired domain has been searched. The default callback attempts to search the entire atomspace, assuming "reasonable" semantics. Under certain unusual semantics, the default may fail to find all matches. In particular, it may fail if the link_match() and node_match() callbacks implement a broader matching semantics than the default.

The Pattern Matcher

The pattern matcher wraps up the engine, described above, together with the default callback class, to provide a reasonably generic, easy-to use system. To make it "easy-to-use", it implements a "variable-typer", an "implicator", an "instantiator" and an "executor". These four additional functions implement the semantics described in the BindLink page. Specifically:

  • The variable-typer is able to decode and process the type restrictions made in the BindLink, and enforce those type restrictions via the default callback. (Again: the engine itself does not set any policy: all policy, such as type-matching, is enforced via user-defined callbacks.)
  • The "implicator" extracts the pattern from an ImplicationLink, and configures it in a format suitable for the engine. It also extracts the implicand, and provides it to the instantiator.
  • The "instantiator" creates a new hypergraph in the atomspace, given a template pattern (containing variables), and a set of variable groundings. By default, the template pattern is the second half of what was found in the ImplicationLink, but this could be over-ridden.
  • The "executor" handles any ExecutionLinks discovered by the instantiator. It is able to call functions in scheme, and pass them arguments; work has been started for it to handle python scripts as well. See github issue #160 Other programming languages can be (easily!) added.

The pattern matcher has several main entry points, which are more-or-less equivalent, except for the format of their arguments:

  • bindlink() -- accepts a BindLink as its sole argument.
  • crisp_logic_bindlink() -- as above, but uses crisp-logic semantics to reject negated clauses.
  • match() --- accepts the same arguments as the engine, except in the format of ListLinks, instead of std::vector<Handle>. Note that a callback must be explicitly specified with this method; it does not use the default callback described above.

Example Walk-through

Perhaps the easiest way to explain how the PM works is with a walk-through of what actually happens.  We start with the set of terms to be unified:

(AndLink
    (EvaluationLink
        (PredicateNode "whichmarker")
        (ListLink
            (VariableNode "$C")
            (VariableNode "$D")))
    (EvaluationLink
        (VariableNode "$D")
        (ListLink
            (VariableNode "$C")
            (VariableNode "$E")))
    (EvaluationLink
        (VariableNode "$A")
        (ListLink
            (VariableNode "$B")
            (VariableNode "$C")))

The first thing that happens is that the perform_search() callback is called.  This callback can be used to limit the domain of the search.  

The default callback does things so that it seems like the whole atomspace is "searched".  Note that they are all variables, except for one constant: PredicateNode "whichmarker".  This is an excellent place to start the "search": we've already narrowed the "search" down to just ONE atom!  Bravo!  

Next, the default callback loops over the entire incoming set of PredicateNode "whichmarker", passing control back to the pattern matcher.   The incoming set is not sorted or truncated in any way, all of it is explored.  But you could change this if desired. For every atom in the incoming set, the PM compares it to "EvaluationLink". It calls a callback called link_match() which is given two links: the EvaluationLink, and the atom in the incoming set.  The default callback declares a match if and only if the atom has type EvaluationLink.    But you could change this if desired; for example, to reject the match if the AV of the atom is too low.   Or you could reject the match if the atom is not in some set. 

If you reject the match, the PM backtracks to the next atom in the incoming set.

If you accept the match, then the PM tries to compare the outgoing sets of the two EvaluationLinks.  Since we already did the PredicateNode, it doesn't have to be re-done.  Next is the ListLink.  Again the link_match() callback is called ... with two arguments: the ListLink from the pattern, and the atom in position 2 of the outgoing set of the EvaluationLink, whatever it may be.  Again, you must accept or reject the comparison.  The default is to accept if and only if the types match.

If you reject the match, the PM backtracks upwards, twice.

If you accept the match,  then the PM tries to compare the outgoing sets of the ListLink's. This time, the variable_match() callback is called, with two arguments. The first one is VariableNode "$C"  and the second one is (for example) ConceptNode "tree".   You can choose to accept or reject this candidate grounding.

The default callback simply makes sure that ConceptNode is one of the allowed types that VariableNode "$C" is allowed to have.

You can write a callback that will reject the match if the tv or av of ConceptNode "tree" is too high or low, or if ConceptNode "tree"  is not a recently generated relex2atom atom. Perhaps you want to check some HebbianLink's to see if ConceptNode "tree" is hot stuff. 

Perhaps one of the best solutions would be to write a generic callback that, whenever its given a VariableNode "$whatever" and a ConceptNode "tree", it rejects the match unless the STI is over 1000.   That way, as long as relex2logic always sets the STI above 1000, and then decays it lower values over time, the search will always be limited to the newest relex2logic atoms.   Simple, easy, straightforward.  This would require only a small amount of new code.  You just have to decide what the selection criterion is.

Anyway, that's how the PM proceeds: it matches incoming and outgoing sets, atom by atom, calling a callback for each possible match.   This sounds real easy, but actually, its not, because of backtracking.  Backtracking requires a stack.   But this is an advanced topic that you don't need to worry about, unless you accept/reject criteria is accumulating data.

Backtracking

For example, if the first three matches had a mediocre TV, maybe you accept them, but, for the 4th one you decide that too much crappy TV has happened, and you reject. Backtracking might backtrack over the first three matches, and so, now you want to start with a clean slate.  There are push() and pop() callbacks that allow you to "clean the slate".

Implementation

The implementation may be found in the /opencog/query directory; see opencog/query/README.md for details.

The core of the pattern matcher is implemented as a stack machine, pushing and popping a stack as it finds partial matches to candidate hypergraphs. It uses a set of callbacks that can be used to define what constitutes a "match", to maintain a user-defined stack, and to modify truth values, attention values, and/or even create new atoms as the traversal progresses. One set of default callbacks provide simple crisp logic (boolean logic) evaluation; another set provide variable-type scoping.

The pattern matcher has programming API's at both low and high levels. Currently, the best examples demonstrating how to use these API's are the functional-test cases in the tests/query directory. These are best studied in the order in which they are run.

This pattern matcher is fully functional. User:Linas has several plans for it, including cleaning up the code base to remove the old-style for-each calls, and to improve performance by embedding atom pointers directly. Other plans include adding user-defined indexes to the atom space. These can be understood as being more-or-less equivalent to the structures of the RETE algorithm, and thus should help with the construction of chainers.

Some Q&A on Callbacks

These questions and answers are pasted here from an email exchange, with some reformatting; the goal being to avoid the same questions being posted and needing to be answered again....

Question:

We want to start with a certain set S of Atoms, and have the PatternMatcher look only at Atoms in the set S, during its whole search process.

Answer:

The PM never actually searches "the atomspace", it has absolutely no idea of the space its exploring. The space is entirely controlled by the user-defined callbacks. The callbacks determine the space S that is searched.

Let me take a step back and quickly review what the structure of an atom is, and what the PM does with it. Atoms(Links) have an outgoing set: this is explicitly hard-coded into a Link, and is immutable. Atoms also have an incoming set, which is only implicitly defined, as the set of all Links that point to it.

Given a single Link, the transitive closure of it's outgoing set can be pictured as a tree. Likewise, given a single Atom, the transitive closure of the incoming set can be pictured as an upside-down tree (viz, the relationship is going in the opposite direction).

The PM itself never actually touches, uses, or accesses the atomspace! (with some strange misc exceptions) Understanding this is key to answering your question. Since PM doesn't know about the atomspace, it doesn't know that its searching it... or any subset of it. It is up to the user-defined callbacks to control the search.

To start the search, the PM needs to be given at least one atom that is in common between the pattern, and the space S. (If its given none, then game over, and no match). To obtain this single atom, it calls the perform_search() callback. The user must define this callback; there's a reasonable default. in place.

The default arbitrarily picks some node in the pattern, any node that is not a variable, and then tells the PM to start searching there.

PM then proceeds by comparing the incoming trees, starting at that common node, looking for a match. To obtain the incoming set of an atom, the PM performs a callback. It is the job of the callback to provide the incoming set. There is a default callback for this: it does a reasonable thing. It returns the ENTIRE incoming set for that atom!

This is the key step to answer your question. What is the "entire incoming set"? How is it defined? Well, it is the set J of all Links X that have the current atom A appearing somewhere in the outgoing set.

To limit the search to your set S, you need to write a callback, called get_incoming_set(), that returns only those Links X that are in your set S (and are also actually in J) -- return the intersection of S and J. That's all.


Note however that to use the subset S, it has to be "computable": you have to run some algorithm to determine what S actually is. In particular, you need to have a set membership function that answers yes/no to indicate membership. If that's "very inefficient", don't blame the PM. You have to define S so that working with S is fast.

If computing set membership in S is excruciatingly slow, you can use a cache. Here's an example cache that is O(1) in the pattern matcher:

class myMiniSubspaceS : PatternMatchCB
{
    std::map<Atom,std::vector<Link>>  my_incoming_set_map;

    std::vector<Link> get_incoming_set(Atom a) {
            return my_incoming_set_map(a);
    }

   void add_atom_to_S(Atom a) {
       if (not a->is_a_link()) return;  // ignore nodes
       foreach(Atom ox in a->outgoing_set()) {
                iset = my_incoming_set_map(ox);
                iset.insert(a);  // because a is in the incoming set of ox
       }
}; // the end.

That's all there's to it, nothing more. The class just holds a cache of incoming sets for the atoms in S. The PM will then only search over stuff in S. It can be very efficient: so if you use hash_map instead of map, the lookup is in O(1) time, even if S is large.

The only "gotcha" is that you have to somehow define what belongs in the set S, and spend the cpu cycles to actually put atoms into S. I don't know how efficient that might be.

Again: the class above is just a cache, which is attempting to cover up the fact that S is very inefficient to compute ...

However, if you can compute the set S efficiently and quickly, then you don't need the above cache. Just compute S on the fly, every time the get_incoming_set() callback is called.

Question:

We want to have the Pattern Matcher prefer matches that have a higher STI (ShortTermImportance)

The get_incoming_set() callback returns a VECTOR of atoms. It searches the first atom first, the second atom second, etc. If you place the vector in sorted STI order, then it will "prefer" higher STI matches first.

Also note that, of the callbacks, one of the last ones to get called is named solution(). It presents the user with one single grounding that the PM thinks it has found, and asks for approval.

If the user-callback replies "I like it", then the search terminates. Otherwise the PM tries again.

The default CB always replies "I hate it", and thus the PM explores *all* possible matches, until it can find no more, Then it dies exhausted, a heart-broken death, having searched the entire kingdom far and wide, and having utterly failed in its quest to find a match.

Now hears the clever bit: the default CB secretly recorded ever match that was returned, and put it in its treasure chest, and then sent out the PM to find yet more. Thus, as the PM scoured the kingdom, the CB built up an exhaustive collection of groundings, because the PM was so through.

However, if the CB was not so greedy, it could save a lot of effort by the PM by simply terminating the search early.

Follow-up question:

Yes, but is there a way to randomize where the PM starts its search, via specifying a random seed for the starting point? Would this also be done in a custom callback?

Answer:

Yes, as already detailed above. However, perhaps a better randomization could be done with get_incoming_set(). Maybe this would work:

When asked for the incoming set of some atom A, perform a random draw to obtain some subset of the true set. Maybe even order this subset by STI. The return that. Since PM always searches the incoming set in order, this random draw will randomize the search. I can't say how efficient it would be ...

Formal definition

Below is a formal definition of the pattern matcher, of the kind you might see in math books. Let Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \Sigma} be a signature, or rather, a collection of both function symbols and predicate symbols. In OpenCog, these correspond to Atoms; that is a given atom type should be thought of as a signature. Let Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle X} denote a set of variables. In OpenCog, these correspond to a collection of VariableNodes with different names.

Given this definition, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle T_\Sigma(X)} is the term algebra of all possible arrangements of Atoms. Likewise, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle T_\Sigma} is the subset of terms that have no variables in them: these are the "grounded terms" or "closed forms".

At any fixed point in time, the AtomSpace stores some subset of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle T_\Sigma(X)} . Let Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle A\subset T_\Sigma(X)} denote the current contents of the atomspace. Let Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle A_g \cap T_\Sigma\subset A} denote the subset of the atomspace consisting only of grounded terms. The pattern matcher is then comes in three forms. The first form is the function Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle sat} , implemented by SatisfactionLink

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle sat: T_\Sigma(X) \times A_g \to \{0,1\}}

that, given some term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle t(x_1, x_2, \cdots ,x_n)\in T_\Sigma(X)} , it locates some (any) grounded term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle t(g_1, g_2, \cdots ,g_n)\in A_g} in the AtomSpace, and returns 1 if such a grounded term can be found; else it returns 0.

The above is slightly oversimplified: by employing the correct term definitions, one can search all of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle A} and not just Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle A_g} , if desired; the same is true for the link types below. However, the current simplification makes things slightly easier to discuss.

The GetLink form is the function

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle get: T_\Sigma(X) \times A_g \to A_g}

that, given some term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle t(x_1, x_2, \cdots ,x_n)\in T_\Sigma(X)} , locates all grounded terms Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle t(g_1, g_2, \cdots ,g_n)\in A_g} in the AtomSpace, and returns a set containing all such lists Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle (g_1, g_2, \cdots ,g_n)} Note that the operation of the GetLink causes the size of the AtomSpace to expand: it is almost certainly the case that the ListLink Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle (g_1, g_2, \cdots ,g_n)} was not previously in the AtomSpace, and even if it was, it probably was not in a SetLink with other similar lists.

The BindLink form is the function

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle bind: T_\Sigma(X) \times T_\Sigma(X) \times A_g \to A_g}

that, given a pair of terms Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle p(x_1, x_2, \cdots ,x_n)\in T_\Sigma(X)} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle q(x_1, x_2, \cdots ,x_n)\in T_\Sigma(X)} , locates all grounded terms Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle p(g_1, g_2, \cdots ,g_n)\in A_g} in the AtomSpace, creates a corresponding Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle q(g_1, g_2, \cdots ,g_n)} , and returns a set of all such terms.

The DualLink is the category oppsite to the GetLink:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle dual: T_\Sigma \times A \to T_\Sigma(X)}

that, given a specific Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle t(g_1, g_2, \cdots ,g_n)} , finds all terms Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle t(x_1, x_2, \cdots ,x_n)\in A} that have one or more variables in place of the grounded atoms. The DualLink becomes interesting and non-trivial, when the variables are globby: are GlobNodes that can match multiple sequential ground terms.

Hmmm. Please realize that the above actually over-simplifies what is really happening: it glosses over a number of very interesting details of the system: most importantly, that the logical connectives AndLink, OrLink are atoms themselves, so that the AtomSpace is not just a term algebra, but a language. Next, the types of the Atoms are atoms themselves... and so terms can have type specifications inside of them. There are even type constructors. The AbsentLink can be used to specify a form of negation... the ChoiceLink can be used to specify choices... and so on.

Possibly similar projects

These projects might be similar:

  • Burlap on github. Quote from IW: Burlap lets you define the problem as a network of nodes with vectors of features or attributes attached to it. The algorithms can search through the network using a combination of brute-force searching and statistically guided exploration. The higher level of the algorithm plans the search and deploys the best algorithms. The toolkit includes dozens of the most useful algorithms for agent-based search.
The tool is useful for data-driven worlds where the data can be mapped into a large collection of nodes or objects. The code is written in Java and includes a large assortment of debugging and profiling tools that are useful for keeping the code moving toward the optimal goal.
  • Guile-Log -- the DataLog subset of ProLog for Guile.

See also