This page gathers a few ideas regarding how to make aspects of OpenCog better exploit multithreading, or GPUs...
Existing things you should know.
The AtomSpace uses a single global lock during atom insertion/deletion. This is a potential bottleneck if you are adding lots of atoms. In practice, its not a bottleneck, since most compute cycles are spent incrementing TruthValues or doing other Value computations; these have a per-Atom lock, and thus extremely low lock contention.
Atomese currently implements two atoms that encourage multi-threading: the ParallelLink and the ThreadJoinLink. These can be used to launch and join threads from within Atomese processes. The ParallelLink provides a scatter API, while ThreadJoinLink implements scatter-gather.
All of the currently implemented StorageNodes are parallelized, and there do not seem to be any opportunities left for further parallelization. To summarize:
- RocksStorageNode uses RocksDB, and RocksDB has a built-in highly sophisticated threading architecture.
- PostgresStorageNode has been parallelized. It uses four threads for write-back queues. This is configurable. (See the source code opencog/persist/sql/multi-driver, look for
_initial_conn_pool_size). This seems to work great. Adding more threads does nothing, since writes are bottlenecked on the SSD hard drive write performance. That is, writes are sent to the PostgreSQL server itself; it is multi-threaded, and it is bottlenecked on disk I/O. That said, the RocksStorageNode is more than four times faster than the PostgresStorageNode, so that is disappointing. The PostgressStorageNode is incredibly complex, and appears to do a whole lot of data mangling, copying and translation.
- CogStorageNode is parallelized. It uses four receive queues (also configurable) to buffer data requests that have not yet received replies. There are no obvious opportunities for further parallelization.
The current guile scheme implementation has a number of parallelism primitives, and they work. However, there seems to be a live-lock issue somewhere: launching more than about 3 or 4 guile threads, and having all of them hit the atomspace at the same time causes the system to burn lots of CPU, but not get very much done. It does not matter if these threads are launched with
par-for-each or if they launched "by hand". It live locks, no one knows why; its hard to investigate. It's super-annoying, too.
I'm not sure what the deal is. Pure guile itself does not have this problem. This is some kind of nasty interaction between guile and the AtomSpace.
Futures and Promises for the Pattern Matcher
Currently, the Pattern matcher doesn't report any results, until it is completely done; it then reports one big giant SetLink of them. github issue #1502 suggests that, instead, each match should be reported, as it is found, by using a MemberLink to add to add the the "set" of results.
Next, github issue #1750 describes a publish-subscribe system, so that interested parties can get notified of results, as they appear (as otherwise, they would have to poll to find out if the pattern matcher was done or not).
If the publish-subscribe system is implemented, it can be used to implements a futures and promises system. That is, processing pipelines can be set up, (each running in a different thread) with one step in a pipeline waiting for results from an earlier step to trickle in, and then processing it as it shows up.
The goal here is to eliminate bottlenecks, where many threads are waiting for one thread to complete.
Pattern Matcher on GPU
If anyone has a 6 months to spare and a lot of patience, a fun little project would be to implement the OpenCog Pattern Matcher on a GPU using MapGraph:
From what I can tell, conceptually the pattern matcher could be implemented using the "Gather-Apply-Scatter" methodology utilized in MapGraph...
However, the details here probably have more than the usual number of devils 8-D
Still, this paper and approach causes me to revisit my previous attitude that "large-scale graph processing and current SIMD GPUs don't mix." It seems they can if you work hard enough...
Of course you won't get the speedups that you'd get for matrix math, but still it's impressive that they can get this to work on various large-scale graph problems...
How might one efficiently implement standard "deep learning" algorithms in the Atomspace?
Suppose we have an Atomspace, consisting of a set of Atom-sets S1, ..., Sn, where each Atom-set consists of a set of Atoms arranged in a 2D or 3D grid (which could have square, triangular, etc. topology)
Each Atom A in each of the grids has some package of values associated with it, indicated by EvaluationLinks of the form
PredicateNode P ListLink Atom A NumberNode N
Suppose we then have a GroundedSchemaNode that needs to be evaluated at every Atom in a certain grid, concurrently.... And suppose this GSN happens to rely on the values stored at other Atoms in the grids in the given Atom-set
Then we have an infrastructure for implementing currently-fashionable "deep learning" approaches in the OpenCog representational framework...
Given appropriate work, it should be possible to use Theano to take such GSNs (if written in python) and make them efficiently evaluated on GPUs.... (with much greater speedup than for, say, making the pattern matcher run on GPU using mapgraph)
This is not high priority at the moment, I'm just articulating an idea before I forget it ;-)
Multihreaded Pattern Matcher
The pattern matcher can be made multi-threaded in three different ways; One way is trivial, one way is easy, and the third way is hard. It is not obvious if the second and third ways are worth the effort.
The easy way
The pattern matcher starts by first obtaining a list of all-possible start points. It then goes over these in a loop, performing a graph traversal for each starting point. It should be easy to convert this loop into an OMP_ALGO loop (OMP_ALGO is a cogutils wrapper around the C++ parallel-for-each loop). Each thread would require it's own private instance of PatternLink, so that the traversal state in one thread does not clobber that in another. Since each thread has it's own copy, no (new) locks are required. Some new code is required to collect the results; see github issue #1502. The actual effort is tracked in github issue #1645.
The hard way
Making the backtracker inside the Pattern Matcher multithreaded would be interesting. There have been nice randomized parallel backtracking algorithms around forever. But how hard it would be to make the PM use one of these, would require some study.
For exhaustive backtracking it's conceptually "straightforward" (the modern algorithms mostly descent from this old paper), but practically difficult: the pattern matcher has a lot of complex, fiddly details to handle unordered links, choice links, optional presence, mandatory absence, variable sequential greedy and non-greedy globbing, grounded predicate nodes, virtual links and disconnected graphs, etc. that make the actual (single-threaded) implementation large and complex. Back-tracking is the "easy part", not trashing the state for all of these different subsystems is the hard part.
The desire to guide the search heuristically makes things more complicated, as they mean a processor with lots of high-priority stuff on its frontier should shove some of it off to other processors that have lower-priority stuff on their frontiers.... Still, I actually suspect this wouldn't be extremely hard to do with the current PM -- it wraps most of its steps up nicely enough in callbacks...
It is not at all clear that this would somehow run faster than the "easy way", described above; where each search starting point is run in it's own thread. It might run slower, depending on how quickly one can lock and unlock each stack push/pop.
The trivial way
The trivial way is to realize that most computations require running the pattern matcher many thousands or millions of times. So, each time you run it, just run it in a different thread! Thus, for almost no coding effort at all, you get massive parallelism!
Any one single pattern match is fast, completing in milliseconds (depending very strong on the dataset), and thus is not a primary bottleneck by itself. The main bottleneck is that it has to be run lots of times, on lots of data; but that can be parallelized using the standard "embarrasingly parallel" approach.