StorageNode

From OpenCog
(Redirected from Using a Backing Store)

The StorageNode is a Node that provides basic infrastructure to exchange Atoms (using load, store, fetch, query) with other AtomSpaces and/or with conventional databases (for persistent (disk) storage). It provides interfaces for both database/disk backends and for networked AtomSpaces. The ProxyNode is the primary tool for connecting up and automating complex processing chains involving multiple AtomSpaces.

Implementations

Existing "pure storage" implementations include:

  • RocksStorageNode -- Works with RocksDB to save/restore AtomSpace data to the local filesystem.
  • MonoStorageNode -- Same as above, optimized for storing single AtomSpaces.
  • PostgresStorageNode -- Works with PostgreSQL to save/restore AtomSpace data to an SQL server.
  • CogStorageNode -- Exchange atoms with another AtomSpace on the network, using the CogServer for communications.
  • CogSimpleStorageNode -- Same as above, but provides a very simple, easy-to-understand example implementation. It can be used as a template for creating new types of StorageNodes.
  • FileStorageNode -- Read/write plain-UTF8 (plain-ASCII) s-expressions to a flat file. This supports only a subset of the API suitable for flat files, so no search or query. Ideal for dumping and loading AtomSpaces as plain text.

The ProxyNodes are StorageNodes, but instead of working directly with actional disks or network, they are agents that perform I/O tasks. Proxies include:

  • ReadThruProxy -- provide load-balancing, by forwarding reads to one of several providers.
  • WriteThruProxy -- provide mirroring and write forwarding to multiple targets.
  • SequentialReadProxy -- walk through a fallback sequence of StorageNodes, stopping with the first to provide an answer. This allows local datasets to be probed first, before falling back to a more distant, more expensive or larger dataset.
  • ReadWriteProxy -- Split the read and write streams, so that they can be directed to two distinct places. This allows a read-write overlay to be placed on top of shared read-only datasets.
  • CachingProxy -- Go to storage, only if the requested Atom is not already in the AtomSpace. This allows AtomSpaces to stay small, while most data lives on the disk. Handy for small-RAM machines and/or humongous datasets.

All of these types, both the "pure storage" nodes and the proxies, use exactly the same API, described below.

Note that the StorageNode itself is a "private" (aka "abstract" or "pure virtual") type: it cannot be used, by itself; only the above subtypes can be directly created and used.

Flat file API

The flat-file API is designed to be high-performance and compact. It only supports a subset of the API below: one can write individual atoms, or dump the entire AtomSpace. It can load a complete file (all of it; subsections cannot be selected). There is no ability to search or query.

The main benefit of the flat-file API is that it is literally 10x faster than dumping atoms by hand to a file. Yes, we measured. Yes, the flat-file API has been performance tuned to go pretty much as fast as possible.

It is also ideal for backing up and archiving large AtomSpaces. When an Atom is stored as a plain-text s-expression, it is typically 50 to 250 bytes in size. When compressed (using bzip2) an Atom is 4 to 10 bytes in size. Thus even huge AtomSpaces have only modest sizes when dumped and compressed. Note that this is a lot smaller than the on-disk format for RocksDB: this is because RocksStorageNode maintains indexes to enable searching and queries. Those indexes take up a lot of storage! This is also about 1.3x to 4x smaller than dumping a Postgres database (of an AtomSpace) and compressing it. This is because the representation of an Atom in SQL is relatively complex.

See opencog/persist/sexpr for the implementation, and examples/atomspace/persist-store.scm for demo code.

Socket API

Note that if you are a clever programmer who understands TCP/IP sockets, then you can use the flat-file API to push Atomese content across a TCP/IP socket. Fast. However, if you do want to do this, you should take a very close look at the CogServer. It has a custom high-speed mode, where it will accept and send Atomese across a socket, and it will also accept/send a certain subset of the commands below, just enough to implement the CogStorageNode. It's designed to be both super-fast and complete, doing everything you need to move Atoms around the network, as fast as possible.

Example

Below is an example of loading an atom from RocksDB, and then sending it to two other CogServers. It is written in scheme; you can do the same thing in python.

; Open connections
(define csna (CogStorageNode "cog://192.168.1.1"))
(define csnb (CogStorageNode "cog://192.168.1.2"))
(define rsn (RocksStorageNode "rocks:///tmp/foo.rdb")

; Load and send all Values attached to the Atom.
(fetch-atom (Concept "foo") rsn)
(store-atom (Concept "foo") csna)
(store-atom (Concept "foo") csnb)

; Specify a query
(define get-humans
   (Meet (Inheritance (Variable "$human") (Concept "person")))

; Specify a place where the results will be linked to 
(define results-key (Predicate "results"))

; Perform the query on the first cogserver 
(fetch-query get-humans results-key csna)

; Print the results to stdout
(cog-value get-humans results-key)

; Send the results to the second cogserver,
; and save locally to disk
(define results (cog-value get-humans results-key))
(store-atom results csnb)
(store-atom results rsn)

; Perform a clean shutdown
(cog-close csna)
(cog-close csnb)
(cog-close rsn)

A detailed, working version of the above can be found in github, in the AtomSpace persistence demo and the persist-multi demo. See also the other examples in the same directory.

Additional detailed, working examples can be found in the assorted github repos:

The API

There are fifteen functions for working with storage or a remote AtomSpace. These include opening and closing a connecting, sending and receiving individual Atoms, sending receiving them in bulk, and performing precise, selective queries to get only those Atoms you want.

The names given below are the scheme names for these functions; the equivalent C++ and python names are almost the same, using an underscore instead of a dash.

The methods are:

  • cog-open -- Open a connection to the remote server or storage.
  • cog-close -- Close the connection to the remote server/storage.
  • fetch-atom -- fetch all of the Values on an Atom from the remote location.
  • fetch-value -- fetch the single named Value on the given Atom. Handy if the Atom has lots of values on it, and you don't want to waste time getting all the others.
  • store-atom -- store all of the Values on the given Atom. That is, send all of the Values to the remote location.
  • store-value -- store just the single named Value on the given Atom. Handy if you don't want to change the other Values that the remote end is holding.
  • cog-delete! -- Delete the Atom in the current AtomSpace, and also the attached storage. The Atom will not be deleted, if it has a non-empty incoming set.
  • cog-delete-recursive! -- Delete the Atom in the current AtomSpace, and also the attached storage. If the Atom has a non-empty incoming set, then those Atoms will be deleted as well.
  • load-atomspace -- get every Atom (and the attached Values) from the remote storage, and place them into the current AtomSpace.
  • load-atoms-of-type -- get every Atom that is of the given type. This loads the Values on those Atoms as well.
  • store-atomspace -- dump the entire contents of the current AtomSpace to the remote server/storage.
  • fetch-incoming-set -- get all of the Atoms that are in the incoming set of the indicated Atom.
  • fetch-incoming-by-type -- get all of the Atoms that are in the incoming set, and are also of the given type. Very useful when an Atom has a very large incoming set, and you want only a specific subset of it.
  • fetch-query -- run the indicated query on the remote server/storage, and get all of the results of that query. You can use BindLink, GetLink, QueryLink, MeetLink and JoinLink to formulate queries. Note that the query has a built-in caching mechanism, so that calling fetch-query a second time may return the same cached query results. That cache can be explicitly cleared, so that a second call re-runs the query. This is handy when queries are large and complex and consume lots of CPU time. There is an experimental time-stamp mechanism on the cache.
  • barrier -- Force all prior network or storage operations to complete (on this particular StorageNode) before continuing with later operations.

Detailed and precise documentation can be displayed at the scheme and python command prompts. For example, typing ,describe fetch-atom at the guile prompt will provide the man page for that function (note the comma in front of the "describe". You can shorten it to ,d). Something similar gets you the python documentation.

All of the above functions are provisional: they have been working and supported for quite a long time, but they should probably be replaced by pure Atomese. That is, there should probably be a FetchAtomLink so that running (cog-execute! (FetchAtomLink ...)) does exactly the same thing as calling fetch-atom. This would be super-easy to do; it hasn't been done, because no one has asked for it yet.

The above API is implemented here: opencog/persist/api.

Creating new implementations

The CogSimpleStorageNode provides the simplest possible (more or less) working implementation of a StorageNode. It can be used as a guideline or example template, when creating new types of StorageNodes. To do this, simply copy the code located here: https://github.com/opencog/atomspace-cog/opencog/persist/cog-simple and rename the files, gut the contents and replace them with what you need. Everything else, including the above-described API will work automatically. You will also need to add a new atom type to this file: opencog/persist/storage/storage_types.script.

If you are designing a new interface to some database system, then it is strongly suggested that the code for the RocksStorageNode be copied and modified. Do NOT use the PostgresStorageNode as an example, even if you are contemplating another SQL database. The code written for PostgresStorageNode is very old, and is deeply flawed in many ways.

See also