Network scalability refers to the issues that come up when one actually tries to use multiple, networked AtomSpaces in a distributed fashion. Some of the issues, and some rel-world experience is documented in this wiki page. They are collected here because a realistic design for a distributed AtomSpace needs to address and solve the issues described here. Without analyzing what the actual usability problems are, any proposed solution is likely to fail.
Keep in mind that there may be multiple distributed AtomSpace solutions. Organizing multiple AtomSpaces together into a functioning system is like organizing a construction crew, or organizing office workers. In general, there are a variety of different roles and functions, and a variety of different ways of organizing. It is more important to identify these roles, than it is to argue about "the correct solution".
As of this writing (mid 2021), we have two prototypical applications, the language learning project, which deals with large databases of words and grammars, and the agi-bio/MOZI project, which deals with large databases of proteins, genes and biochemical interactions. Experiences and requirements are documented below.
Natural language processing
Here are some things that we have learned, and issues that have come up:
- The PostgresStorageNode scales "just fine" to at least six AtomSpace instances running on distinct network-connected multi-core CPU's. These are AtomSpaces that save and restore Atoms to Postgres, using the shared AtomSpace model of network communications, as defined in Networked AtomSpaces.
- Postgres is slow. Or rather, the PostgresStorageNode is large and complex, and interacting with an SQL database requires a lot of copying of data and various intricacies to reassemble an Atom from SQL queries. This complexity is significant, and limits performance. Measured on a 2015-era server CPU (24 cores, 256GB RAM), the PostgresStorageNode was capable of moving about 2K Atoms/second. This is with extensive tuning of the postgres configuration, and a threaded design for the PostgresStorageNode, complete with four asynchronous write-back queues. Measurements showed that the in excess of 50% of the SATA bandwidth was being consumed. That is, of all the potential and real bottlenecks, the bandwidth to the disk drives appears to be one of them. (Use iostat, vmstat and iotop to get this info.)
- RocksDB is fast. The RocksStorageNode provides a single-user, direct-to-disk storage mechanism. It is much much simpler than the Postgres node. It stores all Atoms as ASCII (utf-8) strings -- as Atomese s-expressions. It is easy and fast to convert into in-RAM Atoms. The resulting performance, on the same hardware as above, with the sme kinds of datasets, is about 10K Atoms/second -- that is, about 5x faster.
- On average, one Atom takes about 1 to 1.5KBytes of RAM. This can be directly observed by loading 50 million Atoms, and seeing that this requires approx 64GBytes of RAM, as viewed with vmstat or top. Most of this RAM is used up by the atom incoming set, and the AtomSpace indexes. The C++ class Atom is much smaller, about 100 Bytes. Two conclusions: (1) you don't need scaling if you have less than 100 million Atoms. Just buy more RAM. (2) Its not the size of the C++ objects, its the size of the incoming set that dominates RAM usage.
- Atoms have a Zipfian distribution. That is, half of all Nodes have two incoming Links; a fourth of them have four, and eighth of them have eight, and so on. There's a handful of atoms with millions of incoming Links. In other words, the (raw) language network is a scale-free network. It seems probable that most natural data will also be scale-free, e.g. the biology and genomics data. Again: most RAM is eaten by the incoming set.
- In a typical language-learning compute job, bulk-save to disk is the primary bottleneck. That is, for many of the processing steps, disk I/O dominates: the math of doing some bulk compute pass over the data can be computed in a few hours; the save to disk of the result can take almost a whole day. With the Postgres storage node, the Atom store rates are pathetic, on the order of thousands per second (millions per hour). The RocksDB node is much faster, changing the dynamics a fair bit.
- More comments about Postgres performance: During bulk-save to disk, the primary bottleneck is the disk-drive write performance; the I/O bandwidth to the disk is almost but not quite a bottleneck. This is on a circa-2014 CPU with SATA II and circa-2018 SSD's with upper-tier write performance. The SATA links seem to run at about 50% or 75% of max bandwidth; the disks themselves are the bottleneck (configured as RAID-1). This is SSD; don't even try this with spinning disks. RAID-striping might help with disk-writes. It takes about 2 or 3 CPU cores to saturate the disks in this way; this includes the total CPU cycles burned by both the AtomSpace and by the Postgres servers. That is, a 4-core machine is sufficient to handle bulk store from a single AtomSpace. I expect that newer CPU's with PCIe SSD to have a higher throughput overall, but to have a similar balance in the bottleneck profile. Splurge on PCIe SSD's, you'll be glad you did.
Some guesswork that may help future designers:
- Postgres was configured to run in a fairly conservative mode, which does make it run slower. This was done after tuning it for speed and then experiencing multiple power outages (due to thunderstorms) that corrupted the entire database. This was a major loss of data and work effort. Conclusion: uninterruptable power supplies (UPS) are worth the expense.
- If you have a UPS, then maybe its OK to run one instance of Postrgres tuned for speed, as long as you have most of your data on another, tuned for safety. This does mean, however, more disk drives! (And more configuration and systems-management complexity). Anyway, tuning Postgres for speed can improve write performance, maybe even by a lot (factors of 4x or more??), because it does not have to sync-to-disk.
- Maybe somehow redesigning the Atomspace Postgres driver might help with performance; but this is a major undertaking that requires an experienced (senior) systems programmer with more than several months of spare time to do the work. Maybe half-a-year. And maybe the current design is already close-to-optimal. Its not clear how much can be squeezed, here. Obvious things have been done, already.
- Maybe creating a StorageNode for some other database might maybe help. Or maybe not. Creating another driver might require a month or two for a senior systems programmer. Characterizing it and tuning for a serious workload will take more time. Don't just write a backend, and declare victory; that's the easy part. The hard part is making it run well in actual compute scenarios.
- The primary problem with any kind of StorageNode backed by a database is the need to convert the "native" Atom structure into a form that the database can work with. This requires copying and transforming data, and this very quickly becomes a bottleneck. Sure, a Node is just a UTF-8 string and a type. Sure, a Link is just a type, plus an outgoing set. Sounds trivial, right? Turns out that converting this int SQL requires a whole lot of string copies UUID table lookups. A lot of bouncing around between various unction calls and methods. Simply switching to another database, and being naive about these issues, is not going to be some magic solution. The RocksDB driver manages to be much faster than Postgres mostly because it avoids complexity.
The AGI-Bio experiences are very different. That's because the use case is very different.
- Data replication and ownership. The data ownership problem is described here: issue #1855 describes a very large "master" dataset, which should be treated as read-only, containing the "best" copy of data (the genomic, proteomic data). Layered on top are multiple read-write deltas to it, created/explored by individual researchers who want to try out new hypothesis, or alter specific bits of data, without wrecking the master copy. Because the master dataset is huge, the individual researchers do not want to make a full copy of the master, just to do some minor tinkering.
- The ContextLink (issue #1967) partly solves the read-write layered on read-only problem descried above. It still implicitly assumes that there's a single master. The very concept of "master copy" is still baked into the system.
- The need for copying/mirroring the large dataset is partly to resolve CPU-usage problems. If there are multiple scientists running multiple CPU-intensive jobs, a single AtomSpace on a single machine is not enough: multiple machines are needed. The same master AtomSpace needs to be on each.
- Authorization and security. The current AtomSpace design has only one weak security mechanism: an AtomSpace can be marked read-only. However, anyone can change this to read-write, so the read-only marking only serves to provide a safeguard from runaway accidents. There are no authentication or authorization mechanisms. Anybody can update anything.
- Query CPU crunching is a bottleneck. One typical genomics query requires finding (joining together) two proteins that react, and that also up-regulate or down-regulate genes that express one or both of these proteins. This is a five-point query: a pentagon. The number of combinatorial possibilities runs into the millions, and the queries take tens of minutes or hours to perform. On a 24-core CPU, one can run 24 of these queries in parallel; after that, a second machine is needed. Suppose that the AtomSpace is updated on the first machine. How are these updates to be propagated to the second machine? Currently, there is no "commonly accepted" solution to this problem. i.e. we don't have any production-quality code that solves this.
- Lack of a dashboard. There is no generic dashboard for managing datasets. Some custom, bio-specific dashboard and user interfaces have been created for MOZI, but these are all very specific to that project. It would be nice if there was an AtomSpace dashboard that other projects could use.
As of early 2015, here's what else we had:
- An early implementation of PLN. This has never been run to the point where it exhausts all of the CPU cycles and RAM of a single CPU. That is, it has not yet been tuned to the point where it runs extremely well on 2015-era CPU's (12 or 16-core, 32GB or 64GB RAM). Thus, it's bottlenecks remain unknown, and it is not clear how to scale it up to a distributed system.
What to conclude about scalability
Thus, we don't really know what issues are going to bite us in the ass, not just yet. We can speculate and argue, but such arguments will be rendered moot by the reality of actually having to make something that works. And to make something that works, we will need to understand what the choke-points and the bottlenecks of these applications will be. And right now, we just don't know.
Worse: we don't know which bottlenecks might be easy to overcome, by minor algorithmic changes, and which bottlenecks appear to be deep and fundamental, requiring a major re-think of the system. So: its easy to imagine monsters lurking in the shadows, but many can be dispatched with small, easy changes, rather than expensive, agonizing, total redesigns. Given the lack of data, it is not wise to be either optimistic nor pessimistic.
The said, this proposal from Ben tries to go farther, and look into various issues and problems not addressed below.