Networked AtomSpaces

From OpenCog

There is a considerable amount of confusion about how to build a so-called "distributed AtomSpace", or possibly a "decentralized AtomSpace". Much of the confusion is due to a lack of a clear definition of what these terms mean, and what they imply for the user of the AtomSpace. This page attempts to provide some of these definitions. It also attempts to weigh at least some of the design tradeoffs and implications for the user.

A single process, running on a single machine (in a single address space) can also work with multiple AtomSpaces at the same time. See that article for a discussion of how this can be done.


Design and implementation is being tracked in #2138 #1855 #1502 #1967. Some real-world experiences are documented in the network scalability wiki page.

  • A peer-to-peer communications system has been built, allowing two AtomSpaces to trade Atoms with one-another. It is implemented with the CogStorageNode. It provides good performance, and is suitable for use as a building block.
  • A shared-AtomSpace system has been built, allowing multiple AtomSpace to access a single, common repository of Atoms. This is implemented with the PostgresStorageNode. See also notes below, for a simple alternate implementation.

More sophisticated systems have not been built.


The rest of this page will use the following terminology:

  • An AtomSpace is the in-RAM dataset on the current, local machine. It is the collection of Atoms that are immediately accessible, because they are in RAM at this particular instance in time.
  • Storage is a dataset (a collection of Atoms and Values) that are accessible through the StorageNode API. Each StorageNode provides access to a different set of Atoms (possibly overlapping sets). Those Atom sets might reside on disk on the local machine (for example, RocksStorageNode) or they may reside on a disk on a remote machine (for example, PostgresStorageNode) or they may reside in an AtomSpace on a remote machine (for example, the CogStorageNode).

An AtomSpace can work with any number of StorageNodes at a time, sending Atoms to, and receiving Atoms from each.

Different AtomSpaces may play different roles in a network. Some of these include:

  • A User is an AtomSpace that is accessing data via Storage. That is, a User has some task to perform: it does this by populating it's own AtomSpace with data (possibly by querying Storage), doing that work, and then pushing out results to Storage. Defining the idea of a "User" with a capital U is useful, as it helps clarify one of the roles that AtomSpaces on the network can have.
  • A Provider is an AtomSpace whose primary function is to supply data to Users. An example of a provider might be an AtomSpace used to perform complex, CPU-consuming queries, returning results to Users. Another kind of provider might be one that aggregates results from multiple query responders (such as in a scatter-gather operation).

There is no pre-determined list of roles. The types of roles an AtomSpace might assume depends on the network architecture being constructed.


A network of AtomSpaces can be organized in different ways. The following definitions are convenient.


Two AtomSpaces interact with one-another, exchanging Atoms. This ability is currently provided by the CogStorageNode. It uses the CogServer to provide point-to-point networking communications, and the s-expression shell to provide a high-performance protocol for sending, receiving and querying sets of Atoms. The peer-to-peer exchange can be accomplished with a grand total of two StorageNodes: one in each local AtomSpace, representing the connection to the other AtomSpace.

The CogStorageNode seems to work well, and seems to be pretty fast. It is suitable as a basic building block for the systems described below, and should be taken as the first choice for building more complex systems.


Two or more AtomSpaces interact with one-another, by trading Atoms with a single, centralized "master" StorageNode. This ability is currently provided by the PostgresStorageNode. This node holds all of the Atoms in a PostgreSQL database; each AtomSpace can individually send Atoms to it, or query for sets of Atoms. Two different AtomSpaces can communicate with each other indirectly, by exchanging Atoms through the centralized server. The networking model is a "hub and spoke" model: there is a single centralized hub, and a spoke extending to each individual AtomSpace. For N AtomSpaces, it suffices to have N StorageNodes: one each per AtomSpace. That single StorageNode is used to talk to the centralized server.

A demo of multiple Atomspaces accessing a single shared data store can be found in /examples/atomspace/distributed.scm. Yes, that demo is using the word "distributed" that conflicts with the definition here. That demo is a "shared" demo, rather than a "distributed" demo.

It should be possible to build a shared system out of a single AtomSpace that is running the RocksStorageNode, and placing itself on the network with the CogServer. This would then allow multiple client AtomSpaces to access it via a CogStorageNode. This is conceptually very easy to do (and has been done in one of the example demos). The tricky part, the part that a production system needs to solve, is memory management: when the number of Atoms sitting in RAM in the server AtomSpace gets too large, some of those Atoms must be written to the RocksDB, and removed from the AtomSpace (thus freeing up RAM). This is tricky because there aren't any timestamps on Atoms, there's no obvious concept of "least recently used". (Yes, of course, the AttentionBank could do this, but it is stunningly inefficient and wasteful. It sounds good in theory; it loses in practice. We need an AttentionBank-light.) The git repo is meant to be the place where policy agents are implemented. See below for a definition of of a policy agent.


A collection of Atoms can be thought of as forming a single large set. This set is distributed across multiple machines. Any given AtomSpace belonging to this distributed network will typically hold only a subset of the grand-total set. Implicit in this definition is that any one AtomSpace can always get any Atom in the grand-total dataset. Thus, for example, any one AtomSpace can perform a query, and receive, in response, every Atom in the grand-total set that satisfies that query.

This last assumption makes the architecture of a distributed AtomSpace network difficult. The network must guarantee that every Atom can be tracked down, no matter where it is presently located. There are several simple solutions, but these have drawbacks. One simple solution is to use the shared model above. In this case, every query is sent to the central server for processing. A drawback is that the central server may be overwhelmed by all the requests it is given. Another drawback is that if the central server goes down, then game over. A different simple solution is to use the peer-to-peer model, where every AtomSpace has a connection to every other AtomSpace. In this case, a query is sent to each peer, and the results are merged together locally. A drawback is that if the peers have large subsets of the total dataset in common, then a lot of duplicated processing is done. If one peer goes down, there is no guarantee that that some fraction of the total dataset is now unavailable. In both cases, there is no fault tolerance without any additional engineering. In both cases, the solution is not scalable (typically bogging down somewhere between 10 and 100 nodes.)

There are two perspectives of a distributed architecture: the insider perspective and the user perspective. An insider is an AtomSpace that is a part of the distributed architecture, serving up data and/or processing upon request. A user is an AtomSpace wishing to make use of the distributed dataset. From the user's perspective, it would be best if there was a single point of contact into the dataset: a single StorageNode that the user works with to gain access to data.


There are several ways to think about federation - federated datasets and federated topologies. Consider a collection of datasets that are loosely related, having some commonalities and many differences: Say, a biology dataset and a medical dataset. Insofar as both contain data about humans, they are related, and might contain overlapping information. Overall, they are curated distinctly from one another: there are distinct authorities for each dataset.

The internal architecture of a single dataset in a federation should be considered as a black box; it may have structure, but this is irrelevant. A federation might expect to have many users who work with several members of the federation: for example, users who wish to access both the medical and the biological datasets. It is reasonable to ask the users to contact each member of the federation individually. It is also reasonable for the users to expect the data to be mostly harmonious. Thus, the member datasets of the federation are expected to run ad hoc processes to harmonize duplicated data.


A large random network of AtomSpaces. Some members of the network may be federations, and some may be distributed AtomSpaces (recalling that, from the above definition, a distributed AtomSpace appears as a single, large AtomSpace to the user. A defining characteristic of a decentralized topology is that there is no central authority. There is typically a lack of harmonization: no expectation that clashing datasets will be brought into consistency with one-another. Useful decentralized services do come with some expectation that there is at least some mechanism for performing wide-ranging queries of the most of the network, typically via some indexing. Classic examples are systems with distributed hash tables (DHT's): any given user (any given client AtomSpace) can issue a query in such a way that one or more authoritative StorageNodes can be contacted to obtain the desired answer. This includes query responses of the form "I don't know, but so-n-so might" with an explicit address (explicit StorageNode) for so-n-so.


A decentralized AtomSpace system, together with an authority mechanism to harmonize the data therein. The authority mechanism determines how members come to agreement with regard to the correct version of conflicting data. For example, in a peer-to-peer network of AtomSpaces, there may be multiple peers holding a copy of some Atom (the "same Atom", as Atoms are globally unique). Each copy may, however have different Values attached to it: different TruthValues or AttentionValues. If a query is performed, and conflicting sets of Values are returned, the authority mechanism is meant to resolve such conflicts (possibly by telling some of the AtomSpaces to update the Values on the Atom in question).

The authority mechanism is respected by all members of the polity -- it is understood to be the "right way" of resolving conflicting TruthValues, or Values in general. Different polities will have different mechanisms. Examples include majority voting, appeals to an expert body, multiple rounds of negotiation and/or bidding.

The authority mechanism, as defined here, is clearly outside of the straight and narrow confines of dealing with network topologies and the inter-AtomSpace communications of Atom sets. Yet it is also clearly a key function for maintaining high quality data.

Emphasis has been placed here, as early versions of the AtomSpace had authority mechanisms baked into the definition of Atoms, and, specifically, merger policies for TruthValues. The above discussions hopefully make clear that implementing a specific authority mechanism into the AtomSpace is a bad idea. It does, however, leave room for the possible implementation of a "MajorityVoteTruthValue" whose actual truth value adjusts to the majority vote of all of it's members. The architectural goal here is ancient: provide mechanism, not policy.

Some authority mechanisms can be implemented as policy agents. The git repo is meant to collect such agents.

Key concepts

The definition of Atoms and AtomSpaces depends on a number of important conceptual properties. The ones that impact networked AtomSpaces are briefly recapped below.

  • By definition, Atoms are meant to be globally unique. Knowing the name, or the outgoing set of an Atom is sufficient to uniquely identify that Atom. Within a single AtomSpace, there cannot be two Atoms having the same name. It is intended that this property holds, "conceptually", in the networked case, as well.
  • By definition, the identity of an Atom is immutable: it's name or outgoing set cannot be changed. Only the Values on the atom, such as the TruthValue (TV) or the AttentionValue (AV), can be changed. Thus, the notion of "sharing an Atom" is really about identifying which Stom it is, and then sharing the Values attached to it.
  • The "same" Atom (by the above definition) being held in two different AtomSpaces may have different and conflicting Values attached to it. A key networking process is the resolution of these conflicts (by means of some policy or other ad hoc measures).
  • Some Value updates need to be performed in a coherent fashion: for example, counts must always be incremented properly, and increments must be merged correctly. Merge strategies that don't wreck count semantics can be achieved with Conflict-free replicated data types (CRDTs). At this time (2021) we do not have any CRDT Values; creating one or more of them, for counting, appears to be a requirement.

Distributed vs. Shared AtomSpaces

A distributed AtomSpace is a collection of individual AtomSpaces, coordinating to create the illusion of a single large dataset. There are two types of AtomSpaces in a distributed AtomSpace: the Providers, or "internal" AtomSpaces, cooperating to maintain the dataset, and the external Users, who are accessing this dataset.

It is presumed that a User accesses a distributed AtomSpace through a single StorageNode. This is for ease-of-use: the client AtomSpace does not have to concern itself with what is going on inside of the distributed system: it has a single point of contact through which it works. From the user's point of view, a shared AtomSpace and a distributed AtomSpace look the same: they're both places you can get Atoms from, and put Atoms into. So what are the differences?

Shared. The prototypical shared AtomSpace is the PostgresStorageNode. It is not an AtomSpace at all: it is a PostgreSQL database. Postgres can handle many AtomSpace clients. It is not known how many; we have not measured. At least three. Maybe a dozen. Maybe more. Postgres itself is distributed: it will run on multiple networked machines. So, for example, one could have a few AtomSpaces connecting to a Postgres server in Chicago, and a few more in New York, and all of them are working with the same Atom dataset, because Postgres knows how to distribute data. How well this might actually work is unknown, because we have never measured AtomSpace performance between cities. You can be the first! Use the demo in /examples/atomspace/distributed.scm as a guideline.

Distributed. A switch to a distributed architecture makes sense if/when the above shared architecture stops scaling. Reasons for not scaling might be:

  • Too many clients, and a single shared server simply cannot handle that many tcp/ip connections at the same time. This will result in server-not-responding type errors.
  • Too many clients randomly accessing data, such that data is almost always on disk, rather than being in RAM. This will lead to poor performance, even for simple requests.
  • Too many clients performing CPU intensive queries (BindLinks, etc.) that overwhelm the CPU resources of the server. This will lead to poor performance overall.

Distributed design issues and requirements

This is a highly incomplete and total random collection of issues and requirements. It is NOT an actual design. The wiki page Network scalability lists some of the real-world experience with networking together AtomSpaces.

  • Data ownership and contextual knowledge. The data ownership problem is here: issue #1855 describes a very large "master" dataset, which should be treated as read-only, containing the "best" copy of data (e.g. genomic, proteomic data). Layered on top are multiple read-write deltas to it, created/explored by individual researchers who want to try new algos out, without wrecking the master copy. Because the master dataset is huge, the individual researchers do not want to make a full copy of the master, just to do some minor tinkering.
  • The ContextLink (issue #1967) only "partly" solves the above. It still implicitly assumes that there's a single master. The very concept of "master copy" is still baked into the system.
  • Authorization and security. The current AtomSpace design has only one weak security mechanism: an AtomSpace can be marked read-only. However, anyone can change this to read-write, so the read-only marking only serves to provide a safeguard from runaway accidents. There are no authentication or authorization mechanisms. Anybody can update anything.
  • Coordination. At this time, there do not exist any proposals in how to shard a large dataset across multiple AtomSpaces. This is in part because we don't actually have any large datasets: a single Atom consumes about 1.5KBytes of RAM (typical), and so a 256GByte server can handle more than 100 million Atoms. That's a lot.
  • Query CPU crunching. At this time, we do have a real-life serious bottle-neck in query processing. One typical genomics query requires finding (joining together) two proteins that react, and that also up or down-regulate genes that express one or both of these proteins. This is a five-point query: a pentagon. The number of combinatorial possibilities runs into the millions, and the queries take tens of minutes or hours to perform. On a 24-core CPU, one can run 24 of these queries in parallel; after that, a second machine is needed. Suppose that the AtomSpace is updated on the first machine. How are these updates to be propagated to the second machine? Currently, there is no "commonly accepted" solution to this problem. i.e. we don't have any production-quality code that solves this.

As should be clear, it is not enough to just "distribute" the AtomSpace. It has to actually work well in real-world scenarios.

Decentralized AtomSpace

There were two very naive attempts to create a "decentralized" AtomSpace. Both failed, for reasons that are obvious, in retrospect. These attempts are (in chronological order)

The initial idea was that one could just somehow jam a bunch of Atoms into a distributed hash table (DHT) and they would just float out there in the cloud, and that would take care of that. Both of the above implementations do this. They "work", in that they pass some of the standard StorageNode unit tests (the IPFS version) and all(?) of them (the OpenDHT version). However, the core idea that motivated them turned out to be deeply flawed.

A detailed critique can be found in, section titled Architecture critique. Briefly,

  • The DHT sits in RAM. Just like the AtomSpace. So they both compete for the same resource. An Atom in the DHT takes up more RAM than an Atom in the AtomSpace. Ouch!
  • Lack of locality. Closely related Atoms, e.g. (Concept "cat") and (Concept "dog") may end up on opposite sides of the planet. Performing a query for household pets potentially twiddles trans-oceanic fiber optics. Ouch!
  • It's not really decentralized, since a single DHT key is used to identify all of the Atoms in some given AtomSpace. (The design allows many different AtomSpaces to co-exist in the DHT).

In retrospect, the above design is just goofy. The correct design should follow the conventional design of file-sharing services. In that design, curators of some given AtomSpace announce that they are a seeder, and upload a meta-description of the kind of data they provide to the DHT. Users would examine the metadata, and then connect directly to the seeder, or to anyone else who was publishing a copy of that AtomSpace. In this architecture, the DHT works only as an index: the DHT holds the meta-data about AtomSpaces. The AtomSpaces themselves are served conventionally (with the CogStorageNode).

There's only one problem with this more conventional architecture: who would want to use it? Do you really want to access some rando AtomSpace out on the net? Well, maybe you could run a private, controlled DHT in the local lab intranet, but why? You already know which of your machines are running what data, you don't really need a DHT for this.

At the moment, this is a solution in search of a problem.

Things that don't work

More than a decade of debate and implementation of networked AtomSpaces has resulted in a laundry list of ideas and concept that don't work. These are listed here, in the hope that readers are dissuaded from repeating these mistakes and confusions.

The key fact driving all of these failures was the failure to understand that Atoms are tiny. Like, really, really tiny. Which also means fast. That means that almost any great idea that you get will be much larger and more complex than an Atom, and so almost anything you do will take more CPU time, and cost more RAM, than an Atom does. Failure to appreciate this is deadly, in an almost literal sense. It kills the enthusiasm and excitement of project participants. It leaves a taste of bitter disappointment in the mouth of willing volunteers.


A UUID is the concept that, since Atoms are meant to be globally unique, then perhaps we can assign an integer, a globally unique integer, to each Atom, and refer to that Atom with the integer. This seems like a good idea, since integers are small. Right? Wrong! First: it is hard (impossible?) to issue UUID's without a central authority. That's a bottleneck. It takes CPU time to issue a UUID. That's a bottleneck. If you have a UUID, and you want to know what Atom it corresponds to, you have to look it up in a table. That takes CPU time. Worse, that takes RAM. Worse, one slot in this lookup table is almost as big as an Atom itself. Thus the UUID table is competing for RAM with the AtomSpace itself! That's not good, that hurts. Overall, UUID's are not just a bad idea, they are a really bad ideaTM.

That said, there may be situations where they really are needed, or where they really might make sense. For these folks, we have a handy-dandy library that does it all, automagically, called "the TLB". It's here: opencog/atomspaceutils/TLB.h. Have at it.

An example where UUID's might be useful is when you have a point-to-point link, and all you need are compact integer representations for Atoms, on that link. Then, the ID's only need to be unique, on that link, and not universally. If the link is slow, then maybe the overhead of the table lookup is worth it. Maybe. You would have to actually build it and measure it, and find out. But building and measuring takes effort. And it might not even work. How much effort do you want to put into this?


REST is a web technology that allows automated systems to make requests and obtain responses using the familiar HTTP framework. Because it's HTTP, it can work on the open internet, punching through through firewalls, etc.. There are a large number of packages, addons and servers that provide authentication and any number of other whiz-bang services and utilities. Can't lose, right? Oh, but you can! Atoms are tiny, much smaller than HTTP headers. You can lose most of your CPU time decoding the CPU header, and the rest of your CPU time parsing the URL and branching to the correct handler that will respond to whatever the request is. We wrote a RESTful server using Python flask and swagger JSON. It was able to handle several hundred Atoms per second. If you have a million-Atom dataset (which normally requires about a GByte of RAM), that means ... hours. Per GByte. Ouch. Clearly not practical.

That said, it might still be possible to build some kind of RESTful system that will provide authentication, and then get out of the way: it would set up a separate TCP/IP socket for a long-running session. For example, use REST to open a communications channel for a CogStorageNode, which does run fast. No one has done this yet. It might be nice. I dunno.


JSON is a nice format for encoding object-oriented data into ASCII/UTF-8. It is widely supported, most programmers are familiar with it, and there is a broad range of tools available to work with it. There's only one problem. Atomese is not object-oriented.

JSON assumes that you work with lists and structures. The structures have named fields; for example, (name, address city, state, phone number). Atomese does not have any named fields: all fields are anonymous, nameless. Now, of course, you could invent something, maybe like the below:

   "Atom type" : "ConceptNode",
   "Node name" : "cats and dogs and other household pets",
   "Outgoing set" : [ ],
   "Values" : [ ]

In the above example, the outgoing set is the empty list, because ConceptNodes don't have an outgoing set; only Links do. What's the problem with the above? Well, for starters, it requires twice as many bytes as the plain Atomese s-expression

(Concept "cats and dogs and other household pets")

The other problem is with the outgoing set. How do you represent that? Well, to save space, you could stick UUID's into it. Hmm. Did you read the bit about UUID's above? Or maybe you could ... stick an s-expression for the Link in there ... Hmmm. And what about Values, which don't even have UUID's? Upshot: encoding Atoms as JSON just isn't very efficient.

What about decoding? Decoding sounds real easy, but is it? Well, your decoder has to handle braces, and commas, and square brackets and quotes. A decoder for s-expressions just has to handle ... parenthesis and quotes. The decoder for s-expressions is smaller, faster, simpler than a JSON decoder. That means: it's more likely to fit in the i-cache of the CPU. So it will run faster. And since S-expressions are smaller, they have a smaller footprint in the d-cache. Which means fewer cache misses. Which means it runs faster. See what the problem is?

Factors of 2x in performance might not sound like a lot, but if you have to move several million Atoms over a network socket, stuff like this matters.

gRPC, ZeroMQ, Protocol Buffers

There are many sexy network protocols out there that promise to do lots of really cool things for you, in an easy, modern and flexible fashion. Out of these, gRPC, ZeroMQ, Protocol Buffers are just three examples; there are many more.

What are you going to do with them? Make some remote procedure calls to some server? Send some headers over? Decode them? Send an Atom or two, encoded as JSON? Use UUID's, so that the messages are small and compact? Really? Are you even listening? Atoms are small. Really really small. Whatever you do, you had better have a mighty fine plan to figure out how all the headers and encoding and decoding and whiz-bang features don't overwhelm the Atoms. It is real easy to build something big and bloated. People do this all the time! People have done this, over and over, for the OpenCog project. It has not been productive, except maybe as a personal learning experience.

Not saying that it cannot be done, but please oh please do not walk into this with your eyes closed. Being naive about this will lead to a few months of wasted coding time, creating a nifty system that just doesn't work very well.

The current pace-setter is the CogStorageNode. It's not perfect. It's not the ultimate. It can be improved upon. There are more than a few things you could do to make it faster. But those things would not be particularly easy. I'm willing to make a bet that if you build a system out of gRPC, ZeroMQ or other modern, whizzy system, that it will be slower than CogStorageNode. I mean, you could beat it, but ... it would take some work. There would have to be a trick or two. You would need to be clever.

Neo4J, other (graph) databases

Like any hot new shiny technology, these sound like a good deal. Redis, Riak, Neo4J? Hypertable? What can go wrong? Well, the answer is more or less a mix of everything that was already said up above. Unless you are extremely careful and diligent with the design, all of the moving parts and pieces will be bigger and bloatier than the Atoms that they are meant to handle. I'm not saying it can't be done; I'm saying, it won't be easy. This isn't an idle claim meant to incite, it's from hard, direct experience. It's from actually trying to make an AtomSpace driver for Postgres. It got made. It works. It's very complex, and has underwhelming performance. There's no obvious fix for the performance. And, yes, other people have tried.

Oh, we also built a driver for Neo4J. It ran 10x or 20x slower than the Postgres driver. However, that might not be the fault of Neo4J, that might be because the driver used a RESTful API, and UUID's and a JSON format. There were more than a few bottlenecks.

Currently, the shining counter-example is the RocksStorageNode, built on RocksDB. That works well, its reasonably fast, and the code for it is fairly simple. But this is only because the API for RocksDB is fairly simple, and RocksDB is fast, itself. Can RocksStorageNode be made to go faster? Yes, absolutely. There are definitely tweaks that could be made to the driver. But the overall lesson was that the speed came from the simplicity and directness of design.

If you are working with any system that needs a lot of translation, a lot of conversion, assorted shims, well, all those shims and conversions will kill performance and eat RAM. If the underlying storage format is similar to the AtomSpace, you will win. If its very different, you will lose.


This is unfair, because Gearman almost works. So, Gearman is a completely obscure and unknown distributed processing system. It's got one good thing going for it: it is simple, and easy and elegant. It implements the distributed work-farm mode of distributed processing. In this mode, you have lots and lots of CPU-consuming jobs, all about the same size, and you want to run them in parallel, by dispatching them to work-servers across the network. The workers crunch the job, and soon as they are done, get the next job, and work on that. What Gearman does is to provide everything you need to make the above easy: Start and manage the remote servers. Send out jobs, so that all of the workers are busy. Maintain a queue of unfinished jobs. Shut down the workers when the work queue is empty. It's actually a pretty neat, good idea. It's a winner, because it's simple.

The only real problem with Gearman is the message format. We should probably create something like a Gearman StorageNode or something like that. On the other hand, work-farms are just ... not that complicated, and it is just not that hard to build one out of CogStorageNodes. Doing this remains an open problem. I think something neat and interesting could be done in this space. It is not entirely obvious that we actually need Gearman to do it.

I think that creating a work-farm for the AtomSpace would be cool and fun, but we kind of need a practical use-case for it. Otherwise, its just another solution looking for a problem.