The OpenCog AtomSpace is a knowledge representation (KR) database and the associated query/reasoning engine to fetch and manipulate that data, and perform reasoning on it. Data is represented in the form of graphs, and more generally, as hypergraphs; thus the AtomSpace is a kind of graph database, the query engine is a general graph re-writing system, and the rule-engine is a generalized rule-driven inferencing system. The vertices and edges of a graph, known as Atoms, are used to represent not only "data", but also "procedures"; thus, many graphs are executable programs as well as data structures. These Atoms, which are permanent and immutable, can be assigned fleeting, changing Values to indicate the truth or likelihood of that atom, or to hold other kinds of transient data.
- 1 Overview
- 2 Current Implementation
- 3 Working with AtomSpaces
- 4 Re-imagining the AtomSpace
- 4.1 Issues
- 4.2 AtomSpaceLinks instead of AtomSpaces
- 4.3 A distributed AtomSpace
There are pre-defined Atoms for many basic knowledge-representation and computer-science concepts. These include Atoms for relations, such as similarity, inheritance and subsets; for logic, such as Boolean and, or, for-all, there-exists; for Bayesian and other probabilistic relations; for intuitionist logic, such as absence and choice; for parallel (threaded) synchronous and asynchronous execution; for expressions with variables and for lambda expressions and for beta-reduction and mapping; for uniqueness constraints, state and a messaging "blackboard"; for searching and satisfiability and graph re-writing; for the specification of types and type signatures, including type polymorphism and type construction (dependent types and type variables TBD).
Because of these many and varied Atom types, constructing graphs to represent knowledge looks kind-of-like "programming"; the programming language is informally referred to as Atomese. It vaguely resembles a strange mashup of SQL (due to queriability), prolog/datalog (due to the logic and reasoning components), lisp/scheme (due to lambda expressions), haskell/caml (due to the type system) and rule engines (due to the graph rewriting and forward/backward chaining inference systems). This "programming language" is NOT designed for use by human programmers (it is too verbose and awkward for that); it is designed for automation and machine learning. That is, like any representation_knowledge representation system, the data and procedures encoded in "Atomese" are meant to be accessed by other automated subsystems manipulating and querying and inferencing over the data/programs. Also, viewed as a programming language, it can be very slow and inefficient and not scalable; it was not designed with efficiency and programming tasks in mind, nor with scalability; but rather, it was designed to allow the generalized manipulation of networks of probabilistic data by means of rules and inferences and reasoning systems. It extends the idea of probabilistic logic networks to a generalized system for automatically manipulating and managing data.
The use of the AtomSpace, and the operation and utility of Atomese, remains a topic of ongoing research and change, as various dependent subsystems are brought online. These include machine learning, natural language processing, motion control and animation, planning and constraint solving, pattern mining and data mining, question answering and common-sense systems, and emotional and behavioral psychological systems. Each of these impose sharply conflicting requirements on the system architecture; the AtomSpace and "Atomese" is the current best-effort KR system for satisfying all these various needs in an integrated way. It is likely to change, as the various current short-comings, design flaws, performance and scalability issues are corrected.
The examples directory contains demonstrations of the various components of the AtomSpace, including the python and scheme bindings, the pattern matcher, the rule engine, and many of the various different atom types and their use for solving various different tasks.
As a symbol table
Once an atom is placed in an atomspace, it becomes unique. Thus, atoms are the same thing as symbols, and the atomspace is a symbol table. The unique ID of a Node is it's string name, and the unique ID of a Link is it's outgoing set.
Examples of symbol tables can be found in python and scheme. In Prolog, symbols are called atoms. While Prolog makes a distinction between atoms, variables and (compound) terms, OpenCog does not: all of these are considered to be atoms of different types (e.g. the VariableNode).
In some programming languages, such as scheme and LISP, symbols can be given properties. The analogous concept here is that of a valuation: each OpenCog atom can be given various different properties, or values.
As a tuple space or blackboard
The intent of the AtomSpace is to offer persistent manipulable storage. As such, it resembles a tuple space. Some tuple spaces take the form of "Object Spaces"; however, unlike an Object Space, the AtomSpace does not offer mutual exclusion of object access. Instead, both Atoms and Values are immutable, while the association between an Atom and a Value can be freely updated by anyone at any time.
As a database
Nodes and Links (and thus hypergraphs) can be created, manipulated and traversed without using the AtomSpace. The AtomSpace provides some additional utility that is not available by using "naked" atoms:
- Uniqueness of Atoms. An AtomSpace will contain only one single Node of a given name, and one single Link with a given outgoing set.
- Indexes. The AtomSpace contains several indexes to provide fast access to certain classes of atoms, such as queries by atom type, by atom name, by outgoing set, and the like.
- Persistence. The contents of an AtomSpace can be saved-to/restored-from some storage medium, such as a traditional SQL database.
- Distributed computing. Atoms can be shared across multiple network servers, by having them use a common database backend. See distributed AtomSpace for details.
- Pattern search. The AtomSpace can be searched for all subgraphs of some particular shape.
- Change notification. The AtomSpace provides signals that are delivered when an atom is added or removed, thus allowing actions to be triggered as the contents change.
Hypergraph Database Design Requirements
The AtomSpace is nothing more than a container for storing (hyper-)graphs. It is optimized so that one can implement (probabilistic/fuzzy) inference-engines/theorem-provers, etc., such as PLN, on top of it. To achieve this, the Atomspace must have some database-like features:
- It must be queriable for all occurrences of certain hypergraphs, i.e. one must be able to perform pattern-matching against arbitrary query patterns.
- It must maintain user-defined indexes of certain common pattern types, using e.g. the RETE algorithm, and/or other database-like indexing systems.
The AtomSpace implementation has some additional requirements:
- Perform queries as fast as possible.
- Be thread-safe.
- Hold hypergraphs consisting of billions or trillions (or more) nodes/links; scale to petabytes (or more).
- Save & restore hypergraphs to media, such as disk, a more traditional SQL or non-SQL database, or other system (e.g. flat files, XML, etc.).
- Exchange, query and synchronize hypergraphs with other, network-remote atomspaces or servers, in a manner as fast as possible, while maintaining as much coherency as possible.
- Allow extremely fast access to an Atom's incoming set, outgoing set, and values, such as the TruthValue and AttentionValue.
The current implementation fails on some of these requirements, especially in scalability; there is room for improvement. Nonetheless, a basic system exists. The following is currently possible:
- The AtomSpace can be accessed using C++, Scheme, Python, Haskell, a web interface (REST) or ZeroMQ. Low-level access via C++ is about 40x faster than access with scheme or python; thus scheme and python are better suited for invoking high-level functions (such as the pattern matcher, which can take a fair amount of time to search large atomspaces.) The REST and ZMQ interfaces are another order of magnitude slower than the scheme/python interfaces, and are thus usable only for casual inspection.
- Access to the AtomSpace via C++ is completely thread-safe. Access via scheme is thread-safe, but seems to somehow serialize to only one thread, presumably due to some currently-unknown design bug/feature. The python interface is not thread-safe.
- AtomSpace contents can be saved/restored as string s-expressions (i.e. scheme), and in an SQL database.
- The SQL storage backend allows a weak form of dynamic, on-demand saving and loading of hypergraphs. The contents of the AtomSpace can be bulk-saved/restored. In addition, individual atoms can be saved/restored on request. This is intentional: forcing all atoms to be automatically mirrored to the database would require a large CPU overhead, especially for workloads that rapidly create and destroy a lot of transient atoms, or modify truth values at a rapid clip. The store-back queues for the SQL database are fully parallel and asynchronous; the current design uses four store-back threads by default.
- Multiple AtomSpaces on multiple network-connected servers can communicate via the SQL backend. That is, they can share common atoms. User algorithms must explicitly push atoms to the database in order for them to become visible to other AtomSpaces. This is intentional: trying to maintain constant synchronization across multiple AtomSpaces would be extremely CPU-intensive, especially since many workloads create and destroy atoms at a rapid pace, and there is no need to share such atoms.
- A generic pattern matcher has been implemented. Queries may be specified as hypergraphs themselves, using SatisfactionLink, BindLink and GetLink. Procedures may be triggered using ExecutionOutputLink. Low-level access to the pattern matcher is possible by coding in C++, scheme or (coming soon!) python.
- A half-dozen hard-coded indexes are kept by the atomspace. User-defined indexes are not yet supported (due to lack of demand).
- Atomspace contents may be viewed with several different visualization tools, including the atomspace visualizer.
- The AtomSpace benchmark tool measures the performance of a dozen basic atom and atomspace operations. A diary in the benchmark directory maintains a historical log of measurements.
Atoms, as objects in C++, are access by means of Handles; a Handle is essentially a pointer to an atom. Knowing the Handle is enough to retrieve the corresponding atom. When all of the Handles to an atom are gone, then the Atom is deleted. Thus, if you plan to forget your handles, you should give the Atom to the Atompsace, which will not forget it. You can ask the Atomspace for it later.
The AtomSpace provides signals that are delivered when an atom is added or removed, thus allowing actions to be triggered as the contents change.
The current implementation of the AtomSpace signals is very strongly deprecated; it has a very strong, serious negative performance impact on all users. The current signals API is likely to be removed at some future date.
Multiple atom spaces within a single program can be used, either independently, or as nested (hierarchical) environments, or both. Multiple atom spaces on multiple network-connected machines can (automatically) communicate via the SQL backend (if they are all pointed to the same backend).
The AtomTable provides the indexes and the signals for the AtomSpace. Its the main class that actuall holds the references to atoms. The
AtomSpaceImpl class provides two services: the interface to the persistence backend, and maintenance of the AttentionBank.
The AtomSpace class itself does nothing at all, other than to provide a uniform API to these other services.
Atoms can be stored on disk, or even another machine, rather than always kept in RAM. Thus, an Atom is fetched into local memory only when it is actually needed. See opencog/atomspace/BackingStore.h for details, and opencog/persist/README for a persistence implementation.
The persistence backend allows multiple AtomSpaces on multiple networked servers to communicate and share atoms. In practice, this can scale to possibly several dozen servers but probably no more than that, depending on the workload.
See distributed AtomSpace for details.
Working with AtomSpaces
There are multiple ways of working with atomspaces. See the multiple AtomSpaces wikipage for more info.
If you are using the scheme bindings:
- ,apropos atomspace -- list all atomspace-related functions available to the scheme programmer. These are:
- cog-atomspace -- get the current atomspace used by scheme
- cog-set-atomspace! ATOMSPACE -- set the current atomspace
- cog-new-atomspace -- create a new atomspace
- cog-atomspace? OBJ -- return true if OBJ is an atomspace, else return false.
- cog-atomspace-uuid ATOMSPACE -- get the UUID of the ATOMSPACE
- cog-atomspace-env ATOMSPACE -- get the parent of the ATOMSPACE
- cog-atomspace-stack -- A stack of atomspaces
- cog-push-atomspace -- Push the current atomspace onto the stack
- cog-pop-atomspace -- Pop the stack
You can get additional documentation for each of the above by typing ,describe func-name. So, for example ,describe cog-new-atomspace prints the documentation for the cog-new-atomspace function.
You can share an atomspace with python by using this:
- python-call-with-as FUNC ATOMSPACE -- Call the python function FUNC, passing the ATOMSPACE as an argument.
Say ,describe python-call-with-as to show the documentation for this function.
The above is provided by the (opencog python) module. For example:
$ guile guile> (use-modules (opencog) (opencog python)) guile> (python-eval " from opencog.atomspace import AtomSpace, TruthValue from opencog.atomspace import types def foo(asp): TV = TruthValue(0.42, 0.69) asp.add_node(types.ConceptNode, 'Apple', TV) ") guile> (python-call-with-as "foo" (cog-atomspace)) guile> (cog-node 'ConceptNode "Apple")
If you are in python, you have these functions available:
- AtomSpace() -- create a new atomspace
- scheme_eval_as(sexpr) -- evaluate scheme sexpr, returning an atomspace.
- set_type_ctor_atomspace(ATOMSPACE) -- set the atomspace that the python type constructor uses.
An example usage of some of the above:
from opencog.atomspace import AtomSpace, TruthValue from opencog.atomspace import types from opencog.scheme_wrapper import scheme_eval_h, scheme_eval_as # Get the atomspace that guile is using asp = scheme_eval_as('(cog-atomspace)') # Add a node to the atomspace TV = TruthValue(0.42, 0.69) a1 = asp.add_node(types.ConceptNode, 'Apple', TV) # Get the same atom, from guile's point of view: a2 = scheme_eval_h('(cog-node \'ConceptNode "Apple")') # Print true if they are the same atom. They should be! print "Are they equal?", a1 == a2
Re-imagining the AtomSpace
Practical day-to-day use of the AtomSpace has shown that there are several practical and theoretical issues. This suggests a change in design. First a review of the issues, followed by several alternatives.
Distributed, Scalable AtomSpace
The current implementation is neither distributed, nor obviously scalable. Numerous discussions have not satisfactorily resolved this issue. The below might clarify the nature of the issues.
SetLinks are Bad
It turns out that the SetLink is a really bad idea. The reason for this is that SetLinks are essentially un-editable. And that a good thing, to a point: we want all atoms to be immutable. However, set membership can and should be fluid, over time, and SetLinks inhibit this fluidity. MemberLinks are much better.
The SetLink page discusses why they are bad in greater detail.
Currently, the GetLink and BindLink will search the entire atomspace for a certain pattern. The GetLink simply searches, while the BindLink searches and replaces. The PutLink will also search-and-replace, but it searches only a singe SetLink. Since SetLinks are bad, it would be better if a modified version of PutLink would explore a collection of MemberLinks instead, to accomplish it's task. Anyway: PutLink and BindLink are similar, except that PutLink searches a smaller set. The FilterLink is a lot like the GetLink ... but it searches a smaller set.
The Atomspace is a giant set
The Atomspace is kind of like a giant set. It knows lots of things about its members. Perhaps its possible to reformulate the AtomSpace explicitly as a kind-of membership atom? If so, then perhaps BindLink/PutlLink could be merged into just one operation, and GetLink/FilterLink could be just one operation.
The ContextLink was originally intended to allow an atom to have different values in different contexts. For example, in the context of "Earth", the statement "the sky is blue" would be true, while in the context of "Mars", it would be false. This suggest that truth values (and all values) should be stored not with the atom itself, but with the ContextLink. Notice, also, that ContextLinks start to look a little bit like MemberLinks...
Atoms are fat; Atomspace insertion is slow
Atoms are large: currently, about 1KBytes to 1.5KBytes for "typical" datasets. Atoms use up bytes to hold values, even if they never actually hold any values (and many atoms - perhaps half or more, never hold values).
AtomSpace insertion is slow, running at the rate of hundreds-of-thousands insertions per second. This seems almost pathetic.
Multiple AtomSpaces are poorly defined
Although the current implementation supports multiple AtomSpaces, the over semantics are rather ill-defined. How these mutiple atomsapces interact with each other during atom insertion, deletion and search is sometimes unintuitive and unexpected.
In practice, multiple atomspaces are never used, except for one important exception: as a scratch space during pattern search, where they are heavily used to hold temporary results. Specifically, to isolate the primary atomspace from side-effects during search.
AtomSpaceLinks instead of AtomSpaces
Given the above issues, perhaps the AtomSpace can be abolished, and replaced by an AtomSpaceLink? That is, the membership of an atom could be accomplished by stating:
AtomSpaceLink AtomSpaceNode "the primary atomspace" SomeAtom
This would indicate that SomeAtom belongs to the AtomSpace called "the primary atomspace". Several things happen here:
- An atom can belong to multiple atomspaces.
- The AtomSpaceLink could function as a ContextLink, and be the natural location for values.
- Because the AtomSpaceLink behaves as a kind-of membership, it could unify the PutLink/BindLink into one, and unify GetLink/FilterLink into one.
What does the AtomSpace actually do?
To evaluate the above proposal, we need to clarify what it is that the atomspace actually does. So, again:
- It makes atoms unique. Why is this needed? So that when one modifies the value (truthvalue) of some atom, one modifies that one single unique atom: this is a memory management issue, of always being able to find the atom that one wishes to work with. This avoids accidentally having two atoms, both of which say "the ball is round", and one has a value of true, while the other has a value of false, and truth value queries become schizoid.
- It makes certain searches very fast, such as "get me all atoms of type X".
- It provides a natural place where a backend can save/restore atoms to disk.
That's it. There's really nothing else.
Who uses these things? Well ...
- The "get me all atoms of type X" is used in only one place: pattern matching.
- The uniqueness of atoms is a virtue, except when its not: "the sky is blue" can be a unique atom, but we need to assign it two different values, based on whether its on Earth or on Mars.
A distributed AtomSpace
Making the AtomSpace less monolithic, and more like a kind-of membership link potentially makes it easier to create a distributed atomspace. It idea here is to use CRDTs - Conflict-free replicated data types, and specifically to leverage existing implementations in databases such as Redis or Riak.
The point is that Atom values are de-facto per-atom key-value datastores: for all intents and purposes, every atom has a key-value store attached to it. Different kinds of values have different merge policies: for example, values that count things need to be updated to hold the total count. Truth values have more subtle update semantics, depending on the truth value type.