The ProxyNode is a kind of StorageNode that acts as a (private) base class for the various kinds I/O proxying agents. Proxy agents manipulate data (Atoms and Values) and pass it on from one storage endpoint to another.
The prototypical example is the agent that moves data between disk and the network. A remote AtomSpace client -- say, the CogStorageNode, only has access to the AtomSpace sitting in the Cogserver. If that AtomSpace is empty (say, because the server was recently started) and all Atoms sit on the disk (say, in a RocksStorageNode) there is the issue of getting Atoms off the disk and into RAM (into the AtomSpace) so that they can be served over the network. One solution is to load everything, but this doesn't work if the dataset is too large to fit in RAM. The proxy agents solve this problem, by forwarding requests from the client, over the network, through the server, down to the disk, and back to the user.
Here's a visual example: the Link Grammar parser, sitting on top of a local AtomSpace, wishes to read Atoms from a remote AtomSpace. The remote AtomSpace is empty, because everything is sitting on disk. The proxy agent, when properly configured, will fetch the desired Atoms from the RocksDB and pass them up to LG.
+----------------+ | Link Grammar | | parser | +----------------+ | AtomSpace | +-------------+ +----------------+ | | | | | CogServer | <<==== Internet ====>> | CogStorageNode | | | | | +-------------+ +----------------+ | AtomSpace | +-------------+ | Rocks | | StorageNode | +-------------+ | RocksDB | +-------------+ | disk drive | +-------------+
List of current agents
Proxy agents currently include:
- ReadThruProxy -- Passes on requests involving the reading of Atoms and Values. This includes `fetch-atom`, `fetch-value`, `fetch-incoming-set` and `fetch-incoming-by-type`. This is a load-balancing proxy: if there are multiple providers, it will pick one and use that for a given request. This can be used to avoid overloading any one data provider.
- SequentialReadProxy -- Pass read requests on to the first StorageNode. If that succeeds, return that. Else try the next in line, until there either the Atom/Value is found, or until the list is exhausted. This allows read-write overlays to be constructed on top of large read-only shared datasets. Another use is to attempt reads of local datasets, before defaulting to a network read, if it cannot be found locally.
- CachingProxy -- If an Atom or Value is already in the AtomSpace, do nothing. Otherwise go to the StorageNode to get it. This allows giant datasets to be kept on disk, while having the AtomSpace swap in just a tiny fragment of it, for processing. This would be useful on small-memory devices (e.g. on Android phones or Raspberry Pi) or with extremely large datasets (e.g. genomic/proteomic datasets, of which only a small part needs to be accessed.)
- WriteThruProxy -- Passes on requests involving the storing of Atoms and Values. This includes `store-atom`, `store-value`, and `store-referrers`. This is a mirroring proxy: if there is more than one target, the write will be made too all of them. This is primarily useful for data-sharing: an incoming write can be redistributed to all interested parties, wherever they may be.
- ReadWriteProxy -- Combines reading and writing into one. This allows the read source to be split from the write target. Useful for constructing read-write overlays on top of read-only datasets.
- DynamicDataProxy -- Create FormulaStreams and FutureStreams on the fly. Experimental.
List of ideas for other Proxies
Because proxies see Atom traffic, they are in a position to modify data as it flows by. The DynamicDataProxy is an early experimental attempt to do this: it decorates all read requests for a FormulaStream. The current intent is that it can be used to (re-)calculate Values on the fly, from current data.
Other interesting ideas include:
- A FilterProxy, that only passes on certain kinds of data.
- A LazyWriterProxy, that avoids writing right away. This allows writes to be deduplicated. (This is "easy" of implement; we've already got thread-safe deduplication code here: /opencog/util/async_buffer.h -- jus wrap that in a Proxy and we're done.)
- A RememberingAgent. Somewhat like the above, it only writes out Atoms with long-term importance.
- Episodic memory. This could be done as a combination of the CachingProxy and the LazyWriter and/or the RememberingAgent: The CachingProxy is used to maintain the "data of the moment", that's of hot interest, and, when the moment has passed, write it out to disk.
- ShardProxy, for managing sharded datasets. Every Atom has a 64-bit hash. That hash is always available; it is widely used by the AtomSpace to perform all sorts of management tasks. It can also be used for sharding i.e. a DHT (distributed hash table). Read or write requests for all Atoms of some given range of hashes can be sent to one StorageNode, and those for a different range to a different node. Seems like this would be easy, almost trivial to implement. This would enable the use of ginormmous data sets: A one terabyte disk can hold almost a billion Atoms, and you can certainly stick several disks into one computer. Sharding becomes interesting only if your dataset does not fit into a dozen terabytes: if your dataset has ten billion Atoms. That's pretty outrageous; no one has anything like that, yet. (but I'm getting there...)
The basic ProxyNode infrastructure means that none of this is all that hard to implement. All the basic infrastructure is in place, and it works.