Distributed AtomSpace Architecture

From OpenCog
Jump to: navigation, search

OpenCog must be able to function in a distributed environment, with many OpenCog applications and MindAgents working collaboratively, sharing Atoms with one another, and storing them in one or more repositories. This page provides a general review of some of the considerations for developing an architecture that further extends the existing abilitiesto perform distributed processing. The current, implemented system is described in the Distributed AtomSpace page.

Ideas about making aspects of OpenCog better leverage multiprocessing can be found here: Parallelizing OpenCog.

What it means to be "distributed"

Regarding running an OpenCog that spans multiple machines, there are several points to be kept in mind:

  • By definition, Atoms live in an AtomSpace, so the discussion below is really about communications between distributed AtomSpaces.
  • By definition, Atoms are meant to be globally unique. Knowing the name, or the outgoing set of an atom is sufficient to uniquely identify that atom. There cannot (must not) be two atoms having the same name. See, however, the discussion marked "Re-imagining the AtomSpace" on the AtomSpace, as it subtly alters the assumption of uniqueness.
  • By definition, the identity of an Atom is immutable: it's name or outgoing set cannot be changed. Only the Values on the atom, such as the TruthValue (TV) or the AttentionValue (AV), can be changed. Thus, the notion of "sharing an atom" is really about identifying which atom it is, and then sharing the values attached to it.
  • Because a copy of an Atom may exist in multiple different locations on the network, and because the values on it are mutable, it is inevitable that different copies of an Atom will have different values associated with them. Thus, the primary task of a distributed atomspace architecture is to move those values around, and give various different applications a choice as to how to merge or replace those values, and how to broadcast and communicate those value changes.
  • Some value updates need to be performed in a coherent fashion: for example, counts must always be incremented properly, and increments must be merged correctly. Merge strategies that don't wreck count semantics can be achieved with Conflict-free replicated data types (CRDTs). It would be advantageous to leverage an existing system that already implements CRDT's, such as Redis or Riak. These make a natural fit, since the values on an atom are a defacto per-atom key-value store, and fit naturally into a key-value database.

What it means to be "scalable"

Another important undercurrent of the architecture is its performance and scalability. Of course, no one wants an architecture that runs poorly, and everyone wants an architecture that scales to a million or even a billion sites. Realistically, though, in order to correctly design a high-performance, scalable distributed AtomSpace, a prototypical application is needed, together with a performance and scaling analysis of that application. Without such an analysis, it becomes hard or impossible to understand choke-points and bottlenecks, and thus, impossible to recommend a design that avoids these choke-points.

Real-world experience

As of this writing (early 2019), we have only one such a prototypical application, the language learning project, together with some basic analysis of what it does and how it works. It has been the primary driver of scalability. Here's what we have learned:

  • Postgres scales "just fine" to at least six cogserver instances running on distinct network-connected multi-core CPU's, all of them talking to a single Postgres server box sitting on the local one-gigabit Ethernet (gigE) network.
  • On average, one atom takes about 1 to 1.5KBytes of RAM. This can be directly observed by loading 50 million atoms, and seeing that this requires approx 64GBytes of RAM, as viewed with vmstat or top. Most of this RAM is used up by the atom incoming set, and the atomspace indexes. The C++ class Atom is much smaller, about 100 Bytes. Two conclusions: (1) you don't need scaling if you have less than 100 million atoms. Just buy more RAM. (2) Its not the size of the C++ objects, its the size of the incoming set that dominates RAM usage.
  • Atoms have a Zipfian distribution. That is, half of all nodes have two incoming links; a fourth of them have four, and eighth of them have eight, and so on. There's a handful of atoms with millions of incoming links. In other words, the (raw) language network is a scale-free network. It seems probable that most natural data will also be scale-free, e.g. the biology and genomics data. Again: most RAM is eaten by the incoming set.
  • In a typical compute job, bulk-save to disk is the primary bottleneck. That is, for some/many compute jobs, disk I/O dominates: the math of doing some bulk compute pass over the data can be computed in a few hours; the save to disk of the result can take almost a whole day. Atom store rates are pathetic, on the order of thousands per second (millions per hour).
  • During bulk-save to disk, the primary bottleneck is the disk-drive write performance; the I/O bandwidth to the disk is almost but not quite a bottleneck. This is on a circa-2014 CPU with SATA II and circa-2018 SSD's with upper-tier write performance. The SATA links seem to run at about 50% or 75% of max bandwidth; the disks themselves are the bottleneck (configured as RAID-1). This is SSD; don't even try this with spinning disks. RAID-striping might help with disk-writes. It takes about 2 or 3 CPU cores to saturate the disks in this way; this includes the total CPU cycles burned by both opencog and by postgres. That is, a 4-core machine is sufficient to handle bulk store from a single OpenCog cogserver instance. I expect that newer CPU's with PCIe SSD to have a higher throughput overall, but to have a similar balance in the bottleneck profile. Splurge on PCIe SSD's, you'll be glad you did.

Conclusions

Some guesswork that may help future designers:

  • Postgres was configured to run in a fairly conservative mode, which does make it run slower. This was done after tuning it for speed and then experiencing multiple power outages (due to thunderstorms) that corrupted the entire database. This was a major loss of data and work effort. Conclusion: uninterruptable power supplies (UPS) are worth the expense.
  • If you have a UPS, then maybe its OK to run one instance of Postrgres tuned for speed, as long as you have most of your data on another, tuned for safety. This does mean, however, more disk drives! (And more configuration and systems-management complexity). Anyway, tuning Postgres for speed can improve write performance, maybe even by a lot (factors of 4x or more??), because it does not have to sync-to-disk.
  • Maybe somehow redesigning the Atomspace Postgres backend tables maybe/might help with performance; but this is a major undertaking that requires an experienced (senior) systems programmer with more than several months of spare time to do the work. And maybe the current design is already close-to-optimal. Its not clear how much can be squeezed, here.
  • Maybe creating a backend for some other database might maybe help. Or maybe not. Creating another backend might require a few months of time for a senior systems programmer; however, characterizing it for a serious workload will require another half-a-year. Don't just write a backend, and declare victory; that's the easy part. The hard part is making it run well in actual compute scenarios. Fact.

Other experience

As of early 2015, here's what else we had:

  • An early implementation of PLN. This has never been run to the point where it exhausts all of the CPU cycles and RAM of a single CPU. That is, it has not yet been tuned to the point where it runs extremely well on 2015-era CPU's (12 or 16-core, 32GB or 64GB RAM). Thus, it's bottlenecks remain unknown, and it is not clear how to scale it up to a distributed system.
  • The pattern matcher currently traverses only the local atomspace. It is straight-forward to extend its search to multiple atomspaces (by fetching the incoming sets of the constants in the query). However, no one has requested this capability yet. No one has tried this.
  • Work on placing biological genetic and proteomic data into the AtomSpace is progressing. This work has not yet exceeded the limits of current-day CPU limitations.

What to conclude about scalability

Thus, we don't really know what issues are going to bite us in the ass, not just yet. We can speculate and argue, but such arguments will be rendered moot by the reality of actually having to make something that works. And to make something that works, we will need to understand what the choke-points and the bottlenecks of these applications will be. And right now, we just don't know.

Worse: we don't know which bottlenecks might be easy to overcome, by minor algorithmic changes, and which bottlenecks appear to be deep and fundamental, requiring a major re-think of the system. So: its easy to imagine monsters lurking in the shadows, but many can be dispatched with small, easy changes, rather than expensive, agonizing, total redesigns. Given the lack of data, it is not wise to be either optimistic nor pessimistic.

The said, this proposal from Ben tries to go farther, and look into various issues and problems not addressed below.

Core Assumptions

I) The core opencog applications and systems (PLN, etc) act on Atoms kept in cpu-local RAM, which are managed by AtomSpace. The core systems cannot/do not directly access Atoms from some remote location, but can work with AtomSpace to get a local copy instantiated.

II) There should be a persistent repository for atoms. This repository can be shared among dozens or thousands or millions of AtomSpace instances. This repository does not need to be instantaneously consistent with all AtomSpaces, but it should be in an eventually-consistent state. OpenCog instances may use this repository to communicate with one another. Lets call this the Repository. There need not be just one: an AtomSpace may connect to multiple repositories: e.g. some may contain linguistic data, others may contain spatial data. Some may combine spatial and lingustic data, but may consist entirely of very old memories.

III) There is a device-driver-like software shim between AtomSpace and the Repository. Lets call this the driver. Its the code that knows how to actually communicate with a repository, using its native API, whatever that may be (the C postgres API for PostgreSQL, the C JNI calls for Neo4J, etc).

IV) There is an API between AtomSpace and driver. Lets call this the backing store API. This is so that the AtomSpace can work with any and all possible remote AtomSpaces and repositories using just one well-known, documented API. There is no limitation on the API, but the simpler, the better. Note: ONLY the AtomSpace is allowed to use the backing-store API. Nothing else is allowed to touch this API. This is not a directive, mandate or requirement, but a definition: by definition, if something else is invoking some method in the API, then, by definition, that method is a part of the AtomSpace API, not the "backing store API". By defintion, the backing store exists only as an internal layer within the atomspace.

V) There is an API between AtomSpace and user applications, such as MindAgents. Lets call this the AtomSpace API. This API must allow applications to have fairly direct control over how atoms are created, deleted and shared between different AtomSpace instances, and the various repositories. That is, direct control over sharing, creation and deletion is ceeded to the actual OpenCog application or MindAgent. The reason for this is that automation may be difficult: some applications might create large amounts of "temporary" atoms (such as raw sensory inputs, or fleeting motor control commands) which should not be shared. Some applications (such as unsupervised language learning) desire explicit control over what is stored. Other applications (such as PLN) desire automated management, using attention allocation. Different agents may implement different kinds of attention allocation algorithms, and thus make different decisions about which atoms to share, and which atoms to delete. These decisions happen above the AtomSpace API; the AtomSpace API only provides a mechanism, it does not implement policy.

Details

II.a) The repository is necesarily a network server. It might be, but need not be, another opencog server instance. It might be a traditional "database". It may be any SQL, or NoSQL database, such as Neo4j. It should not be a single-user disk volume.

II.b) A single-user, local-disk-volume storage backend could/would be useful, but I don't yet understand how this should layer hierarchically with the shared-server of II.a above.

II.c) The repository provides some amount of "smarts" for queries, e.g. possibly map-reduce-like. Just how much smarts is open for debate. I've been assuming a small/moderate amount of smarts, i.e. enough to keep a few indexes, maybe perform a few basic joins, but this may be a bad assumption, and we should consider a repository capable of arbitrarily complex queries.

II.d) Cost of accessing repository is "high"-- i.e. network access, as opposed to local RAM access. Accessing an atom over the network will aways be three orders of magnitude slower than accessing an atom in local RAM. Thus, although the repository may maintain indexes of various sorts, the cost of accessing those indexes is also high.

II.e) The repository need not be a "single" server; it may be P2P-like or have whatever architecture is needed to provide scalability so as to allow for simultaneous use by millions of opencog instances.

III.a) The driver might contain indexes; these indexes are "by definition" in RAM, and are low-cost to query.

I.a) The AtomTable already contains two different indexes.

I.b) There already exists a "completely general" query mechanism for obtaining atoms that are in the AtomSpace. This is the Pattern Matcher. By default, the pattern matcher only traverses the local atomspace, but it is (relatively) straight-forwrd to search a distributed atomspace by implementing the PatternMatchCallback::getIncomingSet() so that it traverses remote atomspaces.

Issues

VI) What should the backing-store API be?

VII) Should the AtomSpace API change? How?

VII.a) What kind of queries are supported by the AtomSpace API?

VII.b) What sort of indexes should the AtomTable keep?

VII.c) Should there be user-definable indexes kept by the AtomTable?

VII.d) What's the API for these user-definable indexes?

VII.e) Should there be locking primitives in the API?

VI.a--VI.e) same as above, but for the BackingStore API.

VIII) Should there be locks in the repository? What would their semantics be?

IX) How do different AtomSpace clients (i.e. those on remote machines) maintain (eventual) synchronization with each other?

Requirements

CI) A change-notification system is needed. OpenCog instances must be able to receive events from the repository, indicating atom creation, request-for-deletion, changes in importance, etc. Events may be generated from the repository itself, or by other opencog instances. (It is not clear how urgent or important this requirement really is -- who is depending on this? What algorithm requires this? Can that algorithm be designed in a different way?).

CI.a) These events should be configurable, so as not generate a flood of events delivered to one opencog instance as the total number of instances connected to the repository grows.

CII) Just because an Atom has been created within the AtomSpace, does not mean that it must be pushed out to the repository. i.e. there is no requirement to keep the repository strictly in sync with the local AtomSpace. This is because many OpenCog applications and mindAgents will be creating large quantities of "temporary" or "scratch" atoms that to not need to be saved -- for example, atoms that represent fleeting, raw sensory input, or atoms that represent fleeting motor control commands.

Commentary

Forgetting versus Dropping

Ben mentioned "forgetting" and then talked about either removing the atom completely or pushing it out of the local AtomSpace and into the Repository. These are distinct problems. The current AtomSpace API provides two distinct methods for this: cog-purge and cog-delete. The 'purge' version removes the atom from the local AtomSpace; the 'delete' version removes it from the remote repository, as well as the local AtomSpace.

Which of these two are applied, and when, is meant to be under the control of a Forgetting Agent: one or more algorithms which decide which atoms can be deleted, based on assorted criteria, such as importance. See Attention Allocation for additonal remarks.

Attention Allocation Heuristics

Attention allocation isn't just for memory management and forgetting/dropping atoms.

  • high STI and low LTI atoms should probably not be stored to the Repository unless/until their LTI increase.
  • low STI atoms should probably be dropped from local instances.
  • old and very low LTI atoms should probably be forgotten from the Repository.
  • since STI only really makes sense in terms of in-memory atoms, perhaps the Repository shouldn't store this information.
  • Does it make sense to store LTI in the Repository?
  • Atoms freshly retrieved from the Repository would be given 0 STI. Huh? Why? Because retrieval is expensive, some mind-agent must have decided it was worth the effort to retreive them. This implies that they would have a high STI, else, there's no need to retreive!
  • VLTI (very long term importance) should immediately be put in the Repository.

These are basic guides, but more intelligent forgetting and dropping of atoms will eventually be implemented (given two low LTI atoms, the one that is hardest to recreate, or has more knowledge value, should be forgotten after the other).

(Comment from Ben.) Or if there are 100 medium-LTI Atoms, but these can be quickly and somewhat accurately generated from 20 of them, it may be best to delete the other 80 .. even though they're nowhere near the lowest-LTI ones in the AtomTable.