Distributed AtomSpace Architecture (Obsolete)

From OpenCog

This page is obsolete, and only of historical interest. It describes the situation as it was circa 2015. Many of the ideas described here have been implemented. The way they are described here is awkward, and doesn't really match the current AtomSpace implementation.

Please refer to the wiki page on networked AtomSpaces for a definition of basic terms. Much of the confusion below is due to a lack of a clear distinction between different networking concepts.

OpenCog must be able to function in a distributed environment, with many OpenCog applications and MindAgents working collaboratively, sharing Atoms with one another, and storing them in one or more repositories. This page provides a general review of some of the considerations for developing an architecture that further extends the existing abilitiesto perform distributed processing. The current, implemented system is described in the Distributed AtomSpace page.

Ideas about making aspects of OpenCog better leverage multiprocessing can be found here: Parallelizing OpenCog.

Core Assumptions

I) The core opencog applications and systems (PLN, etc) act on Atoms kept in cpu-local RAM, which are managed by AtomSpace. The core systems cannot/do not directly access Atoms from some remote location, but can work with AtomSpace to get a local copy instantiated.

II) There should be a persistent repository for atoms. This repository can be shared among dozens or thousands or millions of AtomSpace instances. This repository does not need to be instantaneously consistent with all AtomSpaces, but it should be in an eventually-consistent state. OpenCog instances may use this repository to communicate with one another. Lets call this the Repository. There need not be just one: an AtomSpace may connect to multiple repositories: e.g. some may contain linguistic data, others may contain spatial data. Some may combine spatial and lingustic data, but may consist entirely of very old memories.

III) There is a device-driver-like software shim between AtomSpace and the Repository. Lets call this the driver. Its the code that knows how to actually communicate with a repository, using its native API, whatever that may be (the C postgres API for PostgreSQL, the C JNI calls for Neo4J, etc).

IV) There is an API between AtomSpace and driver. Lets call this the backing store API. This is so that the AtomSpace can work with any and all possible remote AtomSpaces and repositories using just one well-known, documented API. There is no limitation on the API, but the simpler, the better. Note: ONLY the AtomSpace is allowed to use the backing-store API. Nothing else is allowed to touch this API. This is not a directive, mandate or requirement, but a definition: by definition, if something else is invoking some method in the API, then, by definition, that method is a part of the AtomSpace API, not the "backing store API". By defintion, the backing store exists only as an internal layer within the atomspace.

V) There is an API between AtomSpace and user applications, such as MindAgents. Lets call this the AtomSpace API. This API must allow applications to have fairly direct control over how atoms are created, deleted and shared between different AtomSpace instances, and the various repositories. That is, direct control over sharing, creation and deletion is ceeded to the actual OpenCog application or MindAgent. The reason for this is that automation may be difficult: some applications might create large amounts of "temporary" atoms (such as raw sensory inputs, or fleeting motor control commands) which should not be shared. Some applications (such as unsupervised language learning) desire explicit control over what is stored. Other applications (such as PLN) desire automated management, using attention allocation. Different agents may implement different kinds of attention allocation algorithms, and thus make different decisions about which atoms to share, and which atoms to delete. These decisions happen above the AtomSpace API; the AtomSpace API only provides a mechanism, it does not implement policy.

Details

II.a) The repository is necesarily a network server. It might be, but need not be, another opencog server instance. It might be a traditional "database". It may be any SQL, or NoSQL database, such as Neo4j. It should not be a single-user disk volume.

II.b) A single-user, local-disk-volume storage backend could/would be useful, but I don't yet understand how this should layer hierarchically with the shared-server of II.a above.

II.c) The repository provides some amount of "smarts" for queries, e.g. possibly map-reduce-like. Just how much smarts is open for debate. I've been assuming a small/moderate amount of smarts, i.e. enough to keep a few indexes, maybe perform a few basic joins, but this may be a bad assumption, and we should consider a repository capable of arbitrarily complex queries.

II.d) Cost of accessing repository is "high"-- i.e. network access, as opposed to local RAM access. Accessing an atom over the network will aways be three orders of magnitude slower than accessing an atom in local RAM. Thus, although the repository may maintain indexes of various sorts, the cost of accessing those indexes is also high.

II.e) The repository need not be a "single" server; it may be P2P-like or have whatever architecture is needed to provide scalability so as to allow for simultaneous use by millions of opencog instances.

III.a) The driver might contain indexes; these indexes are "by definition" in RAM, and are low-cost to query.

I.a) The AtomTable already contains two different indexes.

I.b) There already exists a "completely general" query mechanism for obtaining atoms that are in the AtomSpace. This is the Pattern Matcher. By default, the pattern matcher only traverses the local atomspace, but it is (relatively) straight-forwrd to search a distributed atomspace by implementing the PatternMatchCallback::getIncomingSet() so that it traverses remote atomspaces.

Issues

VI) What should the backing-store API be?

VII) Should the AtomSpace API change? How?

VII.a) What kind of queries are supported by the AtomSpace API?

VII.b) What sort of indexes should the AtomTable keep?

VII.c) Should there be user-definable indexes kept by the AtomTable?

VII.d) What's the API for these user-definable indexes?

VII.e) Should there be locking primitives in the API?

VI.a--VI.e) same as above, but for the BackingStore API.

VIII) Should there be locks in the repository? What would their semantics be?

IX) How do different AtomSpace clients (i.e. those on remote machines) maintain (eventual) synchronization with each other?

Requirements

CI) A change-notification system is needed. OpenCog instances must be able to receive events from the repository, indicating atom creation, request-for-deletion, changes in importance, etc. Events may be generated from the repository itself, or by other opencog instances. (It is not clear how urgent or important this requirement really is -- who is depending on this? What algorithm requires this? Can that algorithm be designed in a different way?).

CI.a) These events should be configurable, so as not generate a flood of events delivered to one opencog instance as the total number of instances connected to the repository grows.

CII) Just because an Atom has been created within the AtomSpace, does not mean that it must be pushed out to the repository. i.e. there is no requirement to keep the repository strictly in sync with the local AtomSpace. This is because many OpenCog applications and mindAgents will be creating large quantities of "temporary" or "scratch" atoms that to not need to be saved -- for example, atoms that represent fleeting, raw sensory input, or atoms that represent fleeting motor control commands.

Commentary

Forgetting versus Dropping

Ben mentioned "forgetting" and then talked about either removing the atom completely or pushing it out of the local AtomSpace and into the Repository. These are distinct problems. The current AtomSpace API provides two distinct methods for this: cog-purge and cog-delete. The 'purge' version removes the atom from the local AtomSpace; the 'delete' version removes it from the remote repository, as well as the local AtomSpace.

Which of these two are applied, and when, is meant to be under the control of a Forgetting Agent: one or more algorithms which decide which atoms can be deleted, based on assorted criteria, such as importance. See Attention Allocation for additonal remarks.

Attention Allocation Heuristics

Attention allocation isn't just for memory management and forgetting/dropping atoms.

  • high STI and low LTI atoms should probably not be stored to the Repository unless/until their LTI increase.
  • low STI atoms should probably be dropped from local instances.
  • old and very low LTI atoms should probably be forgotten from the Repository.
  • since STI only really makes sense in terms of in-memory atoms, perhaps the Repository shouldn't store this information.
  • Does it make sense to store LTI in the Repository?
  • Atoms freshly retrieved from the Repository would be given 0 STI. Huh? Why? Because retrieval is expensive, some mind-agent must have decided it was worth the effort to retreive them. This implies that they would have a high STI, else, there's no need to retreive!
  • VLTI (very long term importance) should immediately be put in the Repository.

These are basic guides, but more intelligent forgetting and dropping of atoms will eventually be implemented (given two low LTI atoms, the one that is hardest to recreate, or has more knowledge value, should be forgotten after the other).

(Comment from Ben.) Or if there are 100 medium-LTI Atoms, but these can be quickly and somewhat accurately generated from 20 of them, it may be best to delete the other 80 .. even though they're nowhere near the lowest-LTI ones in the AtomTable.