AtomSpace

From OpenCog
(Redirected from Atomspace)
Jump to: navigation, search

The OpenCog AtomSpace is a knowledge representation (KR) database and the associated query/reasoning engine to fetch and manipulate that data, and perform reasoning on it. Data is represented in the form of graphs, and more generally, as hypergraphs; thus the AtomSpace is a kind of graph database, the query engine is a general graph re-writing system, and the rule-engine is a generalized rule-driven inferencing system. The vertices and edges of a graph, known as Atoms, are used to represent not only "data", but also "procedures"; thus, many graphs are executable programs as well as data structures. These Atoms, which are permanent and immutable, can be assigned fleeting, changing Values to indicate the truth or likelihood of that atom, or to hold other kinds of transient data.

Overview

The AtomSpace can be thought of in several different ways (outlined below). Primarily, it is meant to be a graph database for storing knowledge. It's design is driven by several philosophical principles; the primary principle is that "all state should be visible to all algorithms". A variant of this principle is well-known to designers of distributed computing systems (i.e. "the cloud"): to move state from one computer to another, the state has to be where you can find it and grab it and transport it. A variant of this shows up in distributed functional languages, like Scala, or functional languages in general, such as Haskell. The AtomSpace tries to extend this principle not only to data movement, and data storage, but to general AI algorithms: learning, logical inference, data mining. In order for some algorithm to perform some reasoning about some data, that algorithm needs to be able to find that data. The place to "find it" is in the atomspace.

To summarize the philosophy: all OpenCog state is in the Atomspace. There isn't any state that isn't in the AtomSpace; it can't be found under a rock, or tucked away in some object. The AtomSpace is one giant closure.

As a database

The AtomSpace is effectively a database, and has many common database-like properties. The basic data structure is that of a graph; so the AtomSpace is a graph database, and is used primarily as an in-RAM database.

The basic structures used to represent the graph are Nodes and Links. These can be created (in RAM), manipulated and traversed without using the AtomSpace. The AtomSpace provides a database structure that is not available by using "naked" atoms:

  • Uniqueness of Atoms. An AtomSpace will contain only one single Node of a given name, and one single Link with a given outgoing set. This is analogous to a key-constraint in an ordinary database, so that, for example, employee "Jane Doe", her salary and her title is stored uniquely, once, with a unique identifier. This helps ensure data reliability, e.g. by avoiding accidentally storing two different salaries for the same person.
  • Indexes. The AtomSpace contains several indexes to provide fast access to certain classes of atoms, such as queries by atom type, by atom name, by outgoing set, and the like. Just as in ordinary databases, indexes are used to speed the search: they affect the overall performance, but are not otherwise directly visible to the user.
  • Persistence. The AtomSpace is primarily used as an in-RAM database, with all of the performance advantages of having data cached in RAM. However, sometimes you need to store that data for longer periods of time; thus, the contents of an AtomSpace can be saved-to/restored-from some storage medium. One currently-used back-end is Postgres, (a traditional SQL database).
  • Decentralized computing vs. distributed computing. The design philosophy of encoding all state as graphs, and disallowing any hidden state means that atoms are easily transported from one computer to another, or shared among network servers. This is achieved by delegating this function to the back-end. There are many strong, scalable, mature distributed computing solutions; rather than re-inventing this technology, the AtomSpace uses them via the backend API. In the current implementation, this means Postgres. See Distributed AtomSpace and Decentralized AtomSpace for details.
  • Query language. A query language provides a way for searching for and finding the data contained within them. The AtomSpace provides a sophisticated and powerful pattern search language, Atomese. It differs from SQL because AtomSpace structures are graphs, not rows in a table. It differs from Gremlin, in that it works at a higher, more abstract level, offering more powerful and refined constructs (basically, its more compact, and easier-to-use). It is partly inspired by DataLog, when the graphs encode logical expressions. We sometimes call it "pattern matching", but this name is perhaps misleading, because the programming language, Atomese, is stored in the AtomSpace itself (and thus differs from the pattern matchers built into Lisp, Haskell, and other programming languages).
  • Change notification. The AtomSpace provides signals that are delivered when an atom is added or removed, thus allowing actions to be triggered as the contents change.

As a symbol table

Once an atom is placed in an atomspace, it becomes unique. Thus, atoms are the same thing as symbols, and the atomspace is a symbol table. The unique ID of a Node is it's string name, and the unique ID of a Link is it's outgoing set.

Examples of symbol tables can be found in python and scheme. In Prolog, symbols are called atoms. While Prolog makes a distinction between atoms, variables and (compound) terms, OpenCog does not: all of these are considered to be atoms of different types (e.g. the VariableNode).

In some programming languages, such as scheme and LISP, symbols can be given properties. The analogous concept here is that of a valuation: each OpenCog atom can be given various different properties, or values.

As a tuple space or blackboard

The intent of the AtomSpace is to offer persistent manipulable storage. As such, it resembles a tuple space. Some tuple spaces take the form of "Object Spaces"; however, unlike an Object Space, the AtomSpace does not offer mutual exclusion of object access. Instead, both Atoms and Values are immutable, while the association between an Atom and a Value can be freely updated by anyone at any time.

Insofar as arbitrary knowledge can be place into the AtomSpace, it roughly follows the blackboard design pattern, that is, it resembles a form of blackboard system.

As a knowledge representation system

Knowledge representation and reasoning in the AtomSpace is primarily accomplished with a type system, formalized on type theory. The goal of formalizing the system is to avoid ad hoc design and implementation decisions. The formality makes it easier to transpose modern AI research results and algorithms into Atomese.

Atomese provides pre-defined Atoms for many basic knowledge-representation and computer-science concepts. These include Atoms for relations, such as similarity, inheritance and subsets; for logic, such as Boolean and, or, for-all, there-exists; for Bayesian and other probabilistic relations; for intuitionist logic, such as absence and choice; for parallel (threaded) synchronous and asynchronous execution; for expressions with variables and for lambda expressions and for beta-reduction and mapping; for uniqueness constraints, state and a messaging "blackboard"; for searching and satisfiability and graph re-writing; for the specification of types and type signatures, including type polymorphism and type construction (dependent types and type variables TBD).

Atomese

Because of these many and varied Atom types, constructing graphs to represent knowledge looks kind-of-like "programming"; the programming language is informally referred to as Atomese. It vaguely resembles a strange mashup of SQL (due to queriability), prolog and datalog (due to the logic and reasoning components), intermediate languages (due to graph rewriting), LISP and Scheme (due to lambda expressions), Haskell and CamL (due to the type system) and rule engines (due to the forward/backward chaining inference system).

This "programming language" is NOT designed for use by human programmers (it is too verbose and awkward for that); it is designed for automation and machine learning. That is, like any knowledge representation system, the data and procedures encoded in Atomese are meant to be accessed by other automated subsystems manipulating and querying and inferencing over the data/programs. More narrowly, it looks like an intermediate language seen inside of compilers and dynamic programming languages: it represents knowledge as abstract syntax trees, which can be manipulated, altered and optimized (that is, the optimizer inside a compiler is an example of a term rewriting system, designed to transform a source language into an optimized machine language; the intermediate language is where the optimization takes place)

Atomese has aspects of declarative programming, procedural program and functional programming (as well as query programming, of course). Viewed as a programming language for procedural or functional programs, the current implementation of Atomese is actually very slow, inefficient and not scalable, mostly because it is interpreted, rather than compiled. The API itself provides a natural way to compile and cache bytecodes; its just not implemented. Until recently, the design has been driven by the need to enable generalized manipulation of declarative networks of probabilistic data by means of rules and inferences and reasoning systems. That is, it extends the idea of probabilistic logic networks to a generalized system for automatically manipulating and managing declarative data.

As a research platform

The use of the AtomSpace, and the operation and utility of Atomese, remains a topic of ongoing research and change, as various dependent subsystems are brought online. These include machine learning, natural language processing, motion control and animation, planning and constraint solving, pattern mining and data mining, question answering and common-sense systems, and emotional and behavioral psychological systems. Each of these impose sharply conflicting requirements on the system architecture; the AtomSpace and Atomese is the current best-effort KR system for satisfying all these various needs in an integrated way. It is likely to change, as the various current short-comings, design flaws, performance and scalability issues are corrected.

The examples directory contains demonstrations of the various components of the AtomSpace, including the python and scheme bindings, the pattern matcher, the rule engine, and many of the various different atom types and their use for solving various different tasks.

Hypergraph Database Design Requirements

The AtomSpace is nothing more than a container for storing (hyper-)graphs. It is optimized so that one can implement (probabilistic/fuzzy) inference-engines/theorem-provers, etc., such as PLN, on top of it. To achieve this, the Atomspace must have some database-like features:

  • It must be queriable for all occurrences of given hypergraphs, i.e. one must be able to perform pattern-matching against arbitrary query patterns. It needs to solve the graph matching problem.
  • It must maintain user-defined indexes of certain common pattern types, using e.g. the RETE algorithm, and/or other database-like indexing systems.

The AtomSpace implementation has some additional requirements:

  • Perform queries as fast as possible, e.g. using David-Putnam style algorithms or other boolean SAT algorithms.
  • Be thread-safe.
  • Hold hypergraphs consisting of billions or trillions (or more) nodes/links; scale to petabytes (or more).
  • Save & restore hypergraphs to media, such as disk, or traditional SQL, noSQL and graph databases, or other data systems (e.g. RDF, triple stores).
  • Exchange, query and synchronize hypergraphs with other, network-remote atomspaces or servers, in a manner as fast as possible, while maintaining as much coherency as possible. Rather than re-inventing fast, reliable, coherent ACID/BASE distributed data computing infrastructure, it should leverage well-known, mature, existing systems to provide this function. That is, the AtomSpace should be a usability layer providing advanced function on top of existing distributed computing systems.
  • Allow extremely fast access to an Atom's incoming set, outgoing set, and values, such as the TruthValue and AttentionValue.

Current Implementation

The current implementation fails on some of these requirements, especially in scalability; there is room for improvement. Nonetheless, a basic system exists. The following is currently possible:

  • The AtomSpace can be accessed using C++, Scheme, Python, Haskell, a web interface (REST) or ZeroMQ. Low-level access via C++ is about 4x faster than access with scheme, and 40x faster than with python; thus scheme and python are best suited for invoking high-level functions. (Guile version 2.9 adds compiled bytecode via GNU Lightning, and so begins to approach C++ for performance) The REST and ZMQ interfaces are another order of magnitude slower than the scheme/python interfaces, and are thus usable only for casual inspection. This is mostly because atoms are really very small (almost always less than 100 bytes, typically following a Zipf distribution), whereas REST/ZMQ work efficiently only for very large objects (megabytes, such as images and web pages).
  • A GNU R interface to the matrix code looks like it could be a major win! This could vastly improve usability for the biology and genomics projects (where most of the code is in R). See opencog/matrix for details.
  • Access to the AtomSpace via C++ is completely thread-safe. Access via scheme is thread-safe, but behaves poorly when more than 3 or 4 concurrent threads are running; it seems to live-lock somehow, presumably due to some currently-unknown design bug/feature in guile. The python interface is probably thread-safe, but always serializes to a single thread (this is how python is designed).
  • AtomSpace contents can be saved/restored as string s-expressions (i.e. scheme), and in an SQL database.
  • The SQL storage backend provides user-controlled, on-demand saving and loading of hypergraphs. Individual atoms can be saved/restored on request, as well as all-atoms-by-type, all atoms-in-incoming-set, etc. Storage is user-driven, rather than being "automatic", because automatic fetch and write-back of data requires anticipating what the user might do. Anticipating what the user might do is essentially impossible: a workload might rapidly create and destroy a lot of temporary, transient atoms, or might modify truth values at a rapid clip; saving temporary atoms to the backend is pointless. The existing SQL backend provides multiple parallel asynchronous store-back queues (four threads, by default). Putting Postgres on an SSD, as opposed to a spinning disk makes a HUGE difference: 10x or 20x. Go for PCIe if you can afford it, else SATA III; this is where the bottleneck is.
  • Multiple AtomSpaces on multiple network-connected servers can communicate via the Postgres backend. That is, they can share common atoms. User algorithms must explicitly push atoms to the database in order for them to become visible to other AtomSpaces. This is intentional: trying to maintain constant synchronization across multiple AtomSpaces would be extremely CPU-intensive, especially since many workloads create and destroy atoms at a rapid pace, and there is no need to share such atoms.
  • There's a Neo4J backend, but its very slow, due to the fact that Atoms are tiny, and sending them by network to Neo4J is an immensely huge bottleneck.
  • There's a hypergraph backend. It's mis-designed and broken.
  • A half-dozen hard-coded indexes are kept by the atomspace. User-defined indexes are not yet supported (due to lack of demand).
  • Atomspace contents may be viewed with several different visualization tools, including the atomspace visualizer.
  • The AtomSpace benchmark tool measures the performance of a dozen basic atom and atomspace operations. A diary in the benchmark directory maintains a historical log of measurements.

Handles

Atoms, as objects in C++, are accessed by means of Handles; a Handle is a smart pointer (std::shared_ptr<Atom>) to an atom. Knowing the Handle is enough to retrieve the corresponding atom. When all of the Handles to an atom are gone, then the Atom is deleted. Thus, if you plan to forget your handles, you should give the Atom to the Atompsace, which will not forget it. You can ask the Atomspace for it later.

Smart pointers implement memory management by reference counting. They are "opposite" (in the category-theoretic sense) to garbage collection. There was a version of the atomspace that attempted to use BoehmGC (BDWGC) for garbage collection. This worked poorly, primarily because C++ intrinsics (std::vector, std::map, etc.) use so many pointers and structures internally that very long reference chains show up, often more than ten hops long; garbage collection has trouble chasing such extremely long chains.

Signals

The AtomSpace provides signals that are delivered when an atom is added or removed, thus allowing actions to be triggered as the contents change.

The current implementation of the AtomSpace signals is very strongly deprecated; it has a very strong, serious negative performance impact on all users. The current signals API is likely to be removed at some future date. The basic problem is that if one user is interested in getting events (e.g. the AttentionBank) all users are impacted (e.g. PLN) and the performance impact is large. It is not clear how to avoid punishing everyone, just because one user needs this feature.

The current implementation uses a home-grown signals library from cogutils; an earlier version used boost:signals2, which was literally ten times slower and much much fatter, RAM-wise. It's not clear why boost:signals2 has to be so bloated and slow.

Multiple AtomSpaces

Multiple atom spaces within a single address space can be used, either independently, or layered, one on top of another. Layering is typically useful for creating a temporary atomspace that contains all the atoms in the main atomspace, but will be discarded after use. Layering is also useful when the main atomspace is extremely large (e.g. containing biological or genomic data) and needs to be shared in a read-only fashion between multiple users. A non-shared read-write atomspace can be layered on top of the read-only atomspace; this allows truth values and other values to be edited and altered, without corrupting the data in the main atomspace.

Multiple atomspaces on multiple network-connected machines can all connect to the same the SQL backing store.

The AtomTable

The AtomTable provides the indexes and the signals for the AtomSpace. Its the main class that holds the actual references to atoms.

The AtomSpace class itself does almost nothing at all, other than to provide a uniform API to the AtomTable and other services.

Persistence

Atoms can be stored on disk, or even another machine, rather than always kept in RAM. Thus, an Atom is fetched into local memory only when it is actually needed. See opencog/atomspace/BackingStore.h for details, and opencog/persist/README for a persistence implementation.

The persistence backend allows multiple AtomSpaces on multiple networked servers to communicate and share atoms. In practice, this can scale to possibly several dozen servers but probably no more than that, depending on the workload.

See distributed AtomSpace for details.

Working with AtomSpaces

There are multiple ways of working with atomspaces. See the multiple AtomSpaces wikipage for more info.

Scheme

If you are using the scheme bindings:

  • ,apropos atomspace -- list all atomspace-related functions available to the scheme programmer. These are:
  • cog-atomspace -- get the current atomspace used by scheme
  • cog-set-atomspace! ATOMSPACE -- set the current atomspace
  • cog-new-atomspace -- create a new atomspace
  • cog-atomspace? OBJ -- return true if OBJ is an atomspace, else return false.
  • cog-atomspace-uuid ATOMSPACE -- get the UUID of the ATOMSPACE
  • cog-atomspace-env ATOMSPACE -- get the parent of the ATOMSPACE
  • cog-atomspace-stack -- A stack of atomspaces
  • cog-push-atomspace -- Push the current atomspace onto the stack
  • cog-pop-atomspace -- Pop the stack

You can get additional documentation for each of the above by typing ,describe func-name. So, for example ,describe cog-new-atomspace prints the documentation for the cog-new-atomspace function.

You can share an atomspace with python by using this:

  • python-call-with-as FUNC ATOMSPACE -- Call the python function FUNC, passing the ATOMSPACE as an argument.

Say ,describe python-call-with-as to show the documentation for this function.

The above is provided by the (opencog python) module. For example:

   $ guile
   guile> (use-modules (opencog) (opencog python))
   guile> (python-eval "
   from opencog.atomspace import AtomSpace, TruthValue
   from opencog.atomspace import types

   def foo(asp):
         TV = TruthValue(0.42, 0.69)
         asp.add_node(types.ConceptNode, 'Apple', TV)
   ")
   guile> (python-call-with-as "foo" (cog-atomspace))
   guile> (cog-node 'ConceptNode "Apple")

Python

If you are in python, you have these functions available:

  • AtomSpace() -- create a new atomspace
  • scheme_eval_as(sexpr) -- evaluate scheme sexpr, returning an atomspace.
  • set_type_ctor_atomspace(ATOMSPACE) -- set the atomspace that the python type constructor uses.

An example usage of some of the above:

  from opencog.atomspace import AtomSpace, TruthValue
  from opencog.atomspace import types
  from opencog.scheme_wrapper import scheme_eval_h, scheme_eval_as
 
  # Get the atomspace that guile is using
  asp = scheme_eval_as('(cog-atomspace)') 

  # Add a node to the atomspace
  TV = TruthValue(0.42, 0.69)
  a1 = asp.add_node(types.ConceptNode, 'Apple', TV)

  # Get the same atom, from guile's point of view:
  a2 = scheme_eval_h(asp, '(cog-node \'ConceptNode "Apple")')

  # Print true if they are the same atom.  They should be!
  print "Are they equal?", a1 == a2

Re-imagining the AtomSpace

Practical day-to-day use of the AtomSpace has shown that there are several practical and theoretical issues. This suggests a change in design. First a review of the issues, followed by several alternatives.

Issues

Distributed, Scalable AtomSpace

The current implementation is neither distributed, nor obviously scalable. Numerous discussions have not satisfactorily resolved this issue. The most promising approach is to delegate all responsibility for distribution and scaling to an existing system that provides this. Of course, Postgres already does this, but also Apache Ignite might be suitable.

Several aspects of the AtomSpace have to be rethought and redesigned to make this work well. The below might clarify the nature of the issues.

SetLinks are Bad

It turns out that the SetLink is a really bad idea. The reason for this is that SetLinks are essentially un-editable. That is (was?) a good thing, to a point: we want all atoms to be immutable (for all the usual reasons why immutable data structures are good). However, set membership can and should be fluid, over time, and SetLinks inhibit this fluidity. MemberLinks are much better.

The SetLink page discusses why they are bad in greater detail. Things like the pattern matcher need to be modified to generate MemberLinks (a good thing! It allows parallelism and pipelining!). Things like PutLink and MapLink need to be modified to support MemberLinks and pipelining (so that different threads can stall, independently of one-another, increasing parallelism!).

Pattern searches

Currently, the GetLink and BindLink will search the entire atomspace for a certain pattern. The GetLink simply searches, while the BindLink searches and replaces. The PutLink will also search-and-replace, but it searches only a singe SetLink. Since SetLinks are bad, it would be better if a modified version of PutLink would explore a collection of MemberLinks instead, to accomplish it's task. Anyway: PutLink and BindLink are similar, except that PutLink searches a smaller set. The FilterLink is a lot like the GetLink ... but it searches a smaller set.

The nice thing about using MemberLink is that the pattern matcher can report each match, as it is found, instead of forcing the user to wait until all matches are found. This can, in turn, allow processing piplelines to be wired up, so that the next stage stalls, until the previous stage returns one more blob of data. In other words, MemberLinks allow futures and promises to be implemented; GetLink makes a promise, and PutLink consumes that promise, blocking until GetLink finds something.

The Atomspace is a giant set

The Atomspace is kind of like a giant set. It knows lots of things about its members. Perhaps its possible to reformulate the AtomSpace explicitly as a kind-of membership atom? If so, then perhaps BindLink/PutlLink could be merged into just one operation, and GetLink/FilterLink could be just one operation.

Exactly how to do this, without splurging a lot of RAM, or creating inefficiencies elsewhere is not entirely obvious. Some tricks are possible, (such as virtualized links, similar to GreaterThanLink) but these feel hacky.

ContextLinks

The ContextLink was originally intended to allow an atom to have different values in different contexts. For example, in the context of "Earth", the statement "the sky is blue" would be true, while in the context of "Mars", it would be false. This suggest that truth values (and all values) should be stored not with the atom itself, but with the ContextLink. Notice, also, that ContextLinks start to look a little bit like MemberLinks...

Atoms are fat; Atomspace insertion is slow

Atoms are large: currently, about 1KBytes to 1.5KBytes for "typical" datasets. Atoms use up (16 or 24?) bytes to hold values, even if they never actually hold any values (and many atoms - perhaps half or more, never hold values). Placing values into AtomSpaceLinks (aka ContextLinks) instead of in Atoms could trim the size of the Atom. It would also restore uniqueness to atoms, thus avoiding having to create duplicate atoms when one is read-only and the other is read-write.

AtomSpace insertion is slow, running at the rate of hundreds-of-thousands insertions per second. This seems almost pathetic.

Multiple AtomSpaces are poorly defined

Although the current implementation supports multiple AtomSpaces, the overall semantics are rather ill-defined. How these multiple atomspaces interact with each other during atom insertion, deletion and search is sometimes unintuitive and unexpected.

In practice, multiple atomspaces are infrequently used, except for several important cases: during pattern search, as a scratch space, to hold temporary results (that is, to isolate the primary atomspace from side-effects during search). Temp atomspaces are also used during chaining and PLN.

The idea now becomes that a collection of atomspaces becomes a Kripke frame or (I guess) a general frame. Instead of just having parent/child relationships between atomspaces, there would be a FrameLink describing the relations. AtomSpaces can now be understood as holders of "possible worlds", "possible universes", partially-complete inference trees (BIT's).

AtomSpaceLinks instead of AtomSpaces

Given the above issues, perhaps the AtomSpace can be abolished, and replaced by an AtomSpaceLink? That is, the membership of an atom could be accomplished by stating:

 AtomSpaceLink
     AtomSpaceNode "the primary atomspace"
     SomeAtom

This would indicate that SomeAtom belongs to the AtomSpace called "the primary atomspace". Several things happen here:

  • An atom can belong to multiple atomspaces.
  • The AtomSpaceLink could function as a ContextLink, and be the natural location for values.
  • Because the AtomSpaceLink behaves as a kind-of membership, it could unify the PutLink/BindLink into one, and unify GetLink/FilterLink into one.

What does the AtomSpace actually do?

To evaluate the above proposal, we need to clarify what it is that the atomspace actually does. So, again:

  • It makes atoms unique. Why is this needed? So that when one modifies the value (truthvalue) of some atom, one modifies that one single unique atom: this is a memory management issue, of always being able to find the atom that one wishes to work with. This avoids accidentally having two atoms, both of which say "the ball is round", and one has a value of true, while the other has a value of false, and truth value queries become schizoid.
  • It makes certain searches very fast, such as "get me all atoms of type X".
  • It provides a natural place where a backend can save/restore atoms to disk.

That's it. There's really nothing else.

Who uses these things? Well ...

  • The "get me all atoms of type X" is used in only one place: pattern matching.
  • The uniqueness of atoms is a virtue, except when its not: "the sky is blue" can be a unique atom, but we need to assign it two different values, based on whether its on Earth or on Mars.

A distributed AtomSpace

Making the AtomSpace less monolithic, and more like a kind-of membership link potentially makes it easier to create a distributed atomspace. It idea here is to use CRDTs - Conflict-free replicated data types, and specifically to leverage existing implementations in databases such as Redis or Riak.

The point is that Atom values are de-facto per-atom key-value datastores: for all intents and purposes, every atom has a key-value store attached to it. Different kinds of values have different merge policies: for example, values that count things need to be updated to hold the total count. Truth values have more subtle update semantics, depending on the truth value type.