Distributed AtomSpace Architecture

From OpenCog
Jump to: navigation, search

OpenCog must be able to function in a distributed environment, with many OpenCog applications and MindAgents working collaboratively, sharing Atoms with one another, and storing them in one or more repositories. This page provides a general review of some of the considerations for developing an architecture tht allows such sharing and storage. The current, implemented architecture is described in the Distributed AtomSpace page.

Ideas about making aspects of OpenCog better leverage multiprocessing can be found here: Parallelizing OpenCog

Regarding running an OpenCog that spans multiple machines, there are several points to be kept in mind:

  • By definition, Atoms live in an AtomSpace, so the discussion below is really about communications between distributed AtomSpaces.
  • By definition, Atoms are meant to be globally unique. Knowing the name, or the outgoing set of an atom is sufficient to uniquely identify that atom. There cannot (must not) be two atoms having the same name, although it is up to software and coding practices to implement this guarantee. See, however, the discussion marked "Re-imagining the AtomSpace" on the AtomSpace, as it subtly alters the need for the assumption of uniqueness.
  • By definition, the identity of an Atom is immutable: it's name or outgoing set cannot be changed. Only the Values on the atom, such as the TruthValue (TV) or the AttentionValue (AV), can be changed. Thus, the notion of "sharing an atom" is really about identifying which atom it is, and then sharing the values attached to it.
  • Because a copy of an Atom may exist in multiple different locations on the network, and because the values on it are mutable, it is inevitable that different copies of an Atom will have different values associated with them. Thus, the primary task of a distributed atomspace architecture is to move those values around, and give various different applications a choice as to how to merge, replace or superceed those values, and how to broadcast and communicate those value changes.
  • Some value updates need to be performed in a coherent fashion: for example, counts must always be incremented properly, and increments must be merged correctly. Merge strategies that don't wreck count semantics can be acheived with Conflict-free replicated data types (CRDTs). It would be advantageous to leverage an existing system that already implements CRDT's, such as Redis or Riak. These make a natural fit, since the values on an atom are a defacto per-atom key-value store, and fit naturally into a key-value database.

Another important undercurrent of the architecture is its performance and scalability. Of course, no one wants an architecture that runs poorly, and everyone wants an architecture that scales to a million or even a billion sites. Realistically, though, in order to correctly design a high-performance, scalable distributed AtomSpace, a prototypical application is needed, together with a performance and scaling analysis of that application. Without such an analysis, it becomes hard or impossible to understand choke-points and bottlenecks, and thus, impossible to recommend a design that avoids these choke-points.

As of this writing (early 2015), we do not have such a prototypical application, nor its analysis. Here's what we do have:

  • Some experimental language learning and morphology learning code. This currently works well ("at speed") with 6 cogserver instances running on three multi-core CPU's each with 4 to 16GB RAM. One of the CPU's holds a postgres database, which currently stores millions of word-pairs and other language data (millions of atoms).
  • An early implementation of PLN. This has never been run to the point where it exhausts all of the CPU cycles and RAM of a single CPU. That is, it has not yet been tuned to the point where it runs extremely well on modern day CPU's (12 or 16-core, 32GB or 64GB RAM). Thus, it's bottlenecks remain unknown, and it is not clear how to scale it up to a distributed system.
  • The pattern matcher currently traverses only the local atomspace. It is straight-forward to extend its search to multiple atomspaces (by fetching the incoming sets of the constants in the query). However, no one has requested this capability yet. No one has tried this.
  • Work on placing biological genetic and proteomic data into the AtomSpace is progressing. This work has not yet exceeded the limits of current-day CPU limitations.

Thus, we don't really know what issues are going to bite us in the ass, not just yet. We can speculate and argue, but such arguments will be rendered moot by the reality of actually having to make something that works. And to make something that works, we will need to understand what the choke-points and the bottlenecks of these applications will be. And right now, we just don't know.

Worse: we don't know which bottlenecks might be easy to overcome, by minor algorithmic changes, and which bottlenecks appear to be deep and fundamental, requiring a major re-think of the system. So: its easy to imagine monsters lurking in the shadows, but many can be dispatched with small, easy changes, rather than expensive, agonizing, total redesigns. Given the lack of data, it is not wise to be either optimistic nor pessimistic.

The said, this proposal from Ben tries to go farther, and look into various issues and problems not addressed below.

Core Assumptions

I) The core opencog applications and systems (PLN, etc) act on Atoms kept in cpu-local RAM, which are managed by AtomSpace. The core systems cannot/do not directly access Atoms from some remote location, but can work with AtomSpace to get a local copy instantiated.

II) There should be a persistent repository for atoms. This repository can be shared among dozens or thousands or millions of AtomSpace instances. This repository does not need to be instantaneously consistent with all AtomSpaces, but it should be in an eventually-consistent state. OpenCog instances may use this repository to communicate with one another. Lets call this the Repository. There need not be just one: an AtomSpace may connect to multiple repositories: e.g. some may contain linguistic data, others may contain spatial data. Some may combine spatial and lingustic data, but may consist entirely of very old memories.

III) There is a device-driver-like software shim between AtomSpace and the Repository. Lets call this the driver. Its the code that knows how to actually communicate with a repository, using its native API, whatever that may be (the C postgres API for PostgreSQL, the C JNI calls for Neo4J, etc).

IV) There is an API between AtomSpace and driver. Lets call this the backing store API. This is so that the AtomSpace can work with any and all possible remote AtomSpaces and repositories using just one well-known, documented API. There is no limitation on the API, but the simpler, the better. Note: ONLY the AtomSpace is allowed to use the backing-store API. Nothing else is allowed to touch this API. This is not a directive, mandate or requirement, but a definition: by definition, if something else is invoking some method in the API, then, by definition, that method is a part of the AtomSpace API, not the "backing store API". By defintion, the backing store exists only as an internal layer within the atomspace.

V) There is an API between AtomSpace and user applications, such as MindAgents. Lets call this the AtomSpace API. This API must allow applications to have fairly direct control over how atoms are created, deleted and shared between different AtomSpace instances, and the various repositories. That is, direct control over sharing, creation and deletion is ceeded to the actual OpenCog application or MindAgent. The reason for this is that automation may be difficult: some applications might create large amounts of "temporary" atoms (such as raw sensory inputs, or fleeting motor control commands) which should not be shared. Some applications (such as unsupervised language learning) desire explicit control over what is stored. Other applications (such as PLN) desire automated management, using attention allocation. Different agents may implement different kinds of attention allocation algorithms, and thus make different decisions about which atoms to share, and which atoms to delete. These decisions happen above the AtomSpace API; the AtomSpace API only provides a mechanism, it does not implement policy.

Details

II.a) The repository is necesarily a network server. It might be, but need not be, another opencog server instance. It might be a traditional "database". It may be any SQL, or NoSQL database, such as Neo4j. It should not be a single-user disk volume.

II.b) A single-user, local-disk-volume storage backend could/would be useful, but I don't yet understand how this should layer hierarchically with the shared-server of II.a above.

II.c) The repository provides some amount of "smarts" for queries, e.g. possibly map-reduce-like. Just how much smarts is open for debate. I've been assuming a small/moderate amount of smarts, i.e. enough to keep a few indexes, maybe perform a few basic joins, but this may be a bad assumption, and we should consider a repository capable of arbitrarily complex queries.

II.d) Cost of accessing repository is "high"-- i.e. network access, as opposed to local RAM access. Accessing an atom over the network will aways be three orders of magnitude slower than accessing an atom in local RAM. Thus, although the repository may maintain indexes of various sorts, the cost of accessing those indexes is also high.

II.e) The repository need not be a "single" server; it may be P2P-like or have whatever architecture is needed to provide scalability so as to allow for simultaneous use by millions of opencog instances.

III.a) The driver might contain indexes; these indexes are "by definition" in RAM, and are low-cost to query.

I.a) The AtomTable already contains 6 different indexes.

I.b) There already exists a "completely general" query mechanism for obtaining atoms that are in the AtomSpace. This is the Pattern Matcher. By default, the pattern matcher only traverses the local atomspace, but it is (relatively) straight-forwrd to search a distributed atomspace by implementing the PatternMatchCallback::getIncomingSet() so that it traverses remote atomspaces.

Issues

VI) What's the backing-store API?

VII) Should the AtomSpace API change? How?

VII.a) What kind of queries are supported by the AtomSpace API?

VII.b) What sort of indexes should the AtomTable keep?

VII.c) Should there be user-definable indexes kept by the AtomTable?

VII.d) What's the API for these user-definable indexes?

VII.e) Should there be locking primitives in the API?

VI.a--VI.e) same as above, but for the BackingStore API.

VIII) Should there be locks in the repository? What would their semantics be?

IX) How do different AtomSpace clients (i.e. those on remote machines) maintain (eventual) synchronization with each other?

X) Should attention values be stored in the repository? The answer seems to be "no" if the repository is used only for storage for currently-running MindAgents. However, the answer would need to be "yes" if there is a desire to take a snapshot of a running MindAgent, and restart the snapshot later.

Requirements

CI) A change-notification system is needed. OpenCog instances must be able to receive events from the repository, indicating atom creation, request-for-deletion, changes in importance, etc. Events may be generated from the repository itself, or by other opencog instances. (It is not clear how urgent or important this requirement really is -- who is depending on this? What algorithm requires this? Can that algorithm be designed in a different way?).

CI.a) These events should be configurable, so as not generate a flood of events delivered to one opencog instance as the total number of instances connected to the repository grows.

CII) Just because an Atom has been created within the AtomSpace, does not mean that it must be pushed out to the repository. i.e. there is no requirement to keep the repository strictly in sync with the local AtomSpace. This is because many OpenCog applications and mindAgents will be creating arge quantities of "temporary" or "scratch" atoms that to not need to be saved -- for example, atoms that represent fleeting, raw sensory input, or atoms that reresent fleeting motor control commands.

Commentary

Forgetting versus Dropping

Ben mentioned "forgetting" and then talked about either removing the atom completely or pushing it out of the local AtomSpace and into the Repository. These are distinct problems. The current AtomSpace API provides two distinct methods for this: cog-purge and cog-delete. The 'purge' version removes the atom from the local AtomSpace; the 'delete' version removes it from the remote repository, as well as the local AtomSpace.

Which of these two are applied, and when, is meant to be under the control of a Forgetting Agent: one or more algorithms which decide which atoms can be deleted, based on assorted criteria, such as importance. See Attention Allocation for additonal remarks.

Attention Allocation Heuristics

Attention allocation isn't just for memory management and forgetting/dropping atoms.

  • high STI and low LTI atoms should probably not be stored to the Repository unless/until their LTI increase.
  • low STI atoms should probably be dropped from local instances.
  • old and very low LTI atoms should probably be forgotten from the Repository.
  • since STI only really makes sense in terms of in-memory atoms, perhaps the Repository shouldn't store this information.
  • Does it make sense to store LTI in the Repository?
  • Atoms freshly retrieved from the Repository would be given 0 STI. Huh? Why? Because retrieval is expensive, some mind-agent must have decided it was worth the effort to retreive them. This implies that they would have a high STI, else, there's no need to retreive!
  • VLTI (very long term importance) should immediately be put in the Repository.

These are basic guides, but more intelligent forgetting and dropping of atoms will eventually be implemented (given two low LTI atoms, the one that is hardest to recreate, or has more knowledge value, should be forgotten after the other).

(Comment from Ben.) Or if there are 100 medium-LTI Atoms, but these can be quickly and somewhat accurately generated from 20 of them, it may be best to delete the other 80 .. even though they're nowhere near the lowest-LTI ones in the AtomTable.