This is the key idea:
- The OpenCog project should not try to (re-)invent yet-another massively scalable, distributed ACID/BASE data and processing framework. There are already more of these than you can shake a stick at, each one more massively scalable, reliable, mature, cloud-enabled, industry-proven, widely deployed than the last. We should pick one (or two or three) and use that to provide AtomSpace services.
The AtomSpace provides a number of services, including the pattern matcher that simply are not available on industry-standard graph databases. (No, Gremlin is not anywhere near as powerful as the pattern matcher). Lets stick to what we're good at, and let the industry provide the lower layers.
Distribution should be "relatively easy to implement" because of the following design features in the AtomSpace:
- Atoms are immutable: the name of an atom cannot be changed, nor can it's outgoing set be changed. Only the associated values, such as TruthValues and AttentionValues, can be changed. Thus, when one atom exists in two or more Atomspaces, it is, in a sense, the "same" atom, but holding different values. Thus, the act of sharing atoms is really the act of sharing the values on it, and reconciling/merging them.
- Currently, the sharing and reconciliation strategy is left up to the user. The Atomspace itself does not know enough to "automatically" distribute an atom. For example: an atom may be some short-term temporary work result that should not be shared. If there are millions of these being created and destroyed all the time, it would be dumb to try to automatically share these. Another example: since the only difference between a local and a remote atom is the truth value, how should these be reconciled? Should the remote version always "win"? The local version? Should they be somehow merged? For these reasons, the AtomSpace only provides tools for sharing, but it is up to the user themselves to do the actual sharing.
The current design assumes a single, centralized server that has omniscient knowledge of all atoms that it has been given.
Alternative designs could make use of Conflict-free replicated data types (CRDTs), and specifically, their implementation in databases such as Redis or Riak or Apache Ignite. This makes sense in two ways. First, the values stored on an atom are defacto a per-atom key-value database. Second, the use of CRDT's could vastly simplify merge policy: for example, the merger of values that hold counts of things, and thus must always increment.
A hypothetical, conceptual design for a different sort of Distributed Atomspace is described here: File:Distributed AtomSpace Design Sketch v6.pdf. Its old and not obviously workable; it fails to leverage existing technologies.
Currently, atoms are located in one of two locations:
- Local (atoms are in CPU-attached local RAM)
- Remote (atoms are in a remote, network-attached location)
When working directly with atoms in C++, python or scheme, they are always, by defintion, local. To work with an atom that exists on a remote server, it must first be pulled into local storage. This is done with the scheme methods fetch-incoming-set, fetch-atom, store-atom, load-atoms-of-type and barrier (corresponding to the C++ methods fetchIncomingSet(), fetchAtom(), storeAtom(), loadType() and barrier()).
The store-atom call pushes the atom out to a centralized server.
store-atom in the Postgres implementation is non-blocking, and, by default, uses four asynchronous write-back queues.
fetch-atom can only be used if you already know what the atom is, or know what it's UUID is. Thus, it practical utility is only to obtain the TruthValue from the remote location. Thus, it is perhaps mis-named: it should be called
fetch-tv or something like that. In practice,
fetch-atom is not very useful, whereas
fetch-incoming-set is of central importance and utility.
The fetch-incoming-set will recursively fetch all of the atoms in which this atom appears. Thus, if remote servers have computed a graph that makes use of this atom, all of those graphs will be pulled into local storage.
The barrier is a synchronization primitve: it can be used to guaranteee that all of the current asynchoronous store queues have been drained, before any fetches are performed.
The load-atoms-of-type will instantiate all atoms of a given type in the local server. This is useful for bulk transfer of atoms. To be effective, the type must be judiciously choosen: For example, loading all WordNodes might be reasonable, as there are only some 100K words in the English langauge. Loading all EvaluationLinks is probably unreasonable, as there are surely too many of them to handle.
Currently, there exists a stable and functional backend to PostgreSQL. This backend solution probably scales "easily" to a few dozen opencog servers, although this limit has never been pushed, so the true limit is unknown. It might scale to hundreds, although that will surely require hard work. Actual performance and scalability will depend strongly on the workload. For example, NLP processcessing will probably scale well. Subgraph mining might scale poorly, because the subgraph algo was never designed to be scalable. The pattern matcher currently searches only the local atomspace; it could be esily extended to search multiple atomspaces (by over-riding the default getIncomingSet() callback.)
An excellent discussion of the issues involved in proper database design, and the relationship between SQL and NoSQL dtabases is given in the 2011 paper A co-Relational Model of Data for Large Shared Data Banks, byErik Meijer and Gavin Bierman. This paper suggests a design point to ponder: should a backend store the outgoing set of an atom (and compute the incoming set on the fly), or should it store the incoming set of an atom (and compute the outgoing set on the fly)?
A prototype backend for Hypertable was created for GSOC 2009. It never worked correctly, due to a design flaw/misunderstanding -- the incoming set must not be stored as a table; it must be computed from the outgoing set. The flaw should not be hard to fix. The code is here.
At this time, it is not clear what advantage, if any, these other databases might offer to OpenCog. This is because OpenCog does not currently have any algorithms or code that strech the boundaries of distributed operation. So far, almost no one uses distributed OpenCog.
A prototype exists for a Neo4j Backing Store.