This page sketches how distributed operation currently works in the OpenCog AtomSpace. That is, the existing AtomSpace is already distributed! The file examples/atomspace/distributed.scm contains an example/demo of distributed operation.
A general review of concepts and definitions can be found on the Distributed AtomSpace Architecture page. The distributed AtomSpace should not be confused with the Decentralized AtomSpace, which is a related, but different idea, with different design goals.
The current design embraces this key idea:
- The OpenCog project should not try to (re-)invent yet-another massively scalable, distributed ACID/BASE data and processing framework. There are already more of these than you can shake a stick at, each one more massively scalable, reliable, mature, cloud-enabled, industry-proven, widely deployed than the last. We should pick one (or two or three) and use that to provide AtomSpace services.
The AtomSpace provides a number of services, including the pattern matcher that simply are not available on industry-standard graph databases. (No, Gremlin is not anywhere near as powerful as the pattern matcher). Lets stick to what we're good at, and let the industry provide the lower layers.
Distribution should be "relatively easy to implement" because of the following design features in the AtomSpace:
- Atoms are immutable: the name of an atom cannot be changed, nor can it's outgoing set be changed. Only the associated values, such as TruthValues and AttentionValues, can be changed. Thus, when one atom exists in two or more Atomspaces, it is, in a sense, the "same" atom, but holding different values. Thus, the act of sharing atoms is really the act of sharing the values on it, and reconciling/merging them.
- Currently, the sharing and reconciliation strategy is left up to the user. The Atomspace itself does not know enough to "automatically" distribute an atom. For example: an atom may be some short-term temporary work result that should not be shared. If there are millions of these being created and destroyed all the time, it would be dumb to try to automatically share these. Another example: since the only difference between a local and a remote atom is the truth value, how should these be reconciled? Should the remote version always "win"? The local version? Should they be somehow merged? For these reasons, the AtomSpace only provides tools for sharing, but it is up to the user themselves to do the actual sharing.
The current design assumes a single, centralized server that has omniscient knowledge of all atoms that it has been given.
Alternative designs could make use of Conflict-free replicated data types (CRDTs), and specifically, their implementation in databases such as Redis or Riak or Apache Ignite. This makes sense in two ways. First, the values stored on an atom are defacto a per-atom key-value database. Second, the use of CRDT's could vastly simplify merge policy: for example, the merger of values that hold counts of things, and thus must always increment. (One of the important truth values does nothing but count upwards: its ideal for a CRDT.)
An older proposal for a different Distributed Atomspace architecture is given here: File:Distributed AtomSpace Design Sketch v6.pdf. It is not clear if that proposal continues to be relevant (as the AtomSpace is already distributed; that proposal does not address that fact.)
Currently, atoms are located in one of two locations:
- Local (atoms are in CPU-attached local RAM.)
- Remote (atoms are in a network-attached production database.)
When working directly with atoms in C++, python or scheme, they are always, by defintion, local. To work with an atom that exists on a remote server, it must first be pulled into local storage. This is done with the scheme methods fetch-incoming-set, fetch-atom, store-atom, load-atoms-of-type and barrier (corresponding to the C++ methods fetchIncomingSet(), fetchAtom(), storeAtom(), loadType() and barrier()).
The store-atom call pushes the atom out to a centralized server.
store-atom in the Postgres implementation is non-blocking, and, by default, uses four asynchronous write-back queues.
fetch-atom call can only be used if you already know what the atom is. Thus, it is used to obtain the current Values, such as the TruthValue, attached to that atom. Thus, it is perhaps mis-named: it could have been called
The fetch-incoming-set will recursively fetch all of the atoms in which this atom appears. Thus, if the database server has Links that contain this atom, all of those Links, as well as the Links that contain those Links, recursively, will be pulled into local storage.
The barrier is a synchronization primitve: it can be used to guaranteee that all of the current asynchoronous store queues have been drained, before any fetches are performed.
The load-atoms-of-type will instantiate all atoms of a given type in the local server. This is useful for bulk transfer of atoms. To be effective, the type must be judiciously choosen: For example, loading all WordNodes might be reasonable, as there are only some 100K words in the English langauge. Loading all EvaluationLinks is probably unreasonable, as there are surely too many of them to handle.
Currently, there exists a stable and functional backend to PostgreSQL. Scalability of PostgreSQL (which appears to be quite high, based on blog posts) and the scalability of the algorithm running on the AtomSpaces. The usual laws of scaling apply: "embarrassingly parallel" workloads scale easily, while strongly connected networks or networks with high centrality scale poorly.
For example, natural language processing will probably scale well. This is because natural language has a Zipfian distribution, and the networks it forms are scale-free networks. As long as the algorithm does not update the central, strongly connected nodes with high frequency, things will probably be OK.
Subgraph mining might scale poorly, because the subgraph algo was never designed to be scalable. The pattern matcher currently searches only the local atomspace; it could be easily extended to search multiple atomspaces (by over-riding the default getIncomingSet() callback.) Whether this is a wise idea depends on the desired algorithm.
An alternative to PostgreSQL might be Apache Ignite, primarily because it is C++-friendly. Another alternative is JanusGraph. Or maybe Redis or Riak. It is quite unclear whether these databases offer any advantage at all over PostgreSQL.
Other popular cloud SQL databases include MariaDB. Popular cloud NoSQL databases include Hypertable, HBase and Cassandra. Popular graph databases include Neo4J (see Neo4j Backing Store). Practical experience with Neo4J was that the serialization of Atoms added a huge overhead, leading to unacceptable performance.
A prototype backend for Hypertable was created for GSOC 2009. It never worked correctly, due to a design flaw/misunderstanding. The flaw should not be hard to fix. The last version is here. It has been removed from the main repo, because it got stale.
An excellent discussion of the issues involved in proper database design, and the relationship between SQL and NoSQL databases is given in the 2011 paper A co-Relational Model of Data for Large Shared Data Banks, by Erik Meijer and Gavin Bierman. This paper suggests a design point to ponder: should a backend store the outgoing set of an atom (and compute the incoming set on the fly), or should it store the incoming set of an atom (and compute the outgoing set on the fly)?
At this time, it is not clear what advantage, if any, these other databases might offer over PostgreSQL. This is because OpenCog does not currently have any algorithms or code that stretch the boundaries of distributed operation. So far, almost no one uses the distributed AtomSpace.