Multithreading

From OpenCog
Jump to: navigation, search

Large parts of OpenCog are thread-enabled; this includes the AtomSpace and the scheme interpreter. However, some parts still are not. If you want to port parts of OpenCog to support multithreading, you've come to the right page.

There are several convenient ways to parallelize an algorithm without explicitly having to deal with the usual threading machinery. You can use

  1. Parallelized STL algorithms [1],
  2. C++11 new async function [2]

GCC implements both (although you need version 4.5.2 and above for async).

In addition, there are several useful tools in cogutil for working with threads. These are:

  • opencog/util/async_method_caller.h - Thread-safe, multi-threaded asynchronous write queue. Provides a simple way to call a method on a class, asynchronously. That is, the method is called in a *different* thread, at some later time. This can be very handy if the method takes a long time to run, or if it blocks waiting on I/O. By running in a different thread, it allows the current thread to return immediately. It also enables concurrency: a large number of threads can handle the time-consuming work in parallel, while the master thread can zip along. Unlike C++11 async, this is a "fire-and-forget" system: the queued tasks will eventually be accomplished; the promise is explicit promises, the future is implicit. Load-balancing is done via a blocking hi/lo watermark system. The number of write threads is configurable.
  • opencog/util/concurrent_queue.h - Implements a thread-safe queue (FIFO): any thread can push stuff onto the queue, and any other thread can remove stuff from it. If the queue is empty, the thread attempting to remove stuff will block. If the queue is empty, and something is added to the queue, and there is some thread blocked on the queue, then that thread will be woken up.
The function provided here is almost identical to that provided by the pool class (also in this directory), but with a fancier API that allows cancellation, and other minor utilities. This API is also most easily understood as a producer-consumer API, with producer threads adding stuff to the queue, and consumer threads removing them. By contrast, the pool API is a borrow-and-return API, which is really more-or-less the same thing, but just uses a different mindset. This API also matches the proposed C++ standard for this basic idea (ISO/IEC JTC1 SC22 WG21 N3533 C++ Concurrent Queues)
  • opencog/util/pool.h - Thread-safe blocking resource allocator. If there are no resources to borrow, then the borrow() method will block until one is given back. This maintains a finite-sized pool of resources, which can be borrowed and returned, much like a libary book. As long as the pool is not empty, resources can be borrowed. If the pool is empty, (there is nothing to be borrowed), then the borrowing thread will block, until such time as the pool becomes non-empty. When a resource is returned, one of the blocked threads will be woken up, and given the resource. Wakeups are in arbitrary order, and not in any particular order (the thread waiting the longest is not necessarily the one that gets woken up). Although you might think that boost::pool already does this, you'd be wrong; its mis-designed.
This is similar to concurrent_queue (also in this directory) but with a simpler, less sophisticated API. The main difference is that this API uses named that make it clear that resources are being borrowed, and returned, whereas the concurrent_queue has an API that is geared towards the producer-consumer mode of thinking.

Smart Pointers

The atomspace uses smart pointers for memory management. Smart pointers are almost atomic, but not quite; the atomic thread-safety of shared-pointers is subtle. See boost documentation and the cppwisdom "shared_ptr is almost thread safe" article. It affects how shared pointers are used for TruthValues and AttentionValues, or any kind of mutable data, although Handles are not affected, because Handles are not mutable.

Parallelized STL algorithms

If your algorithm is already relying on STL algorithms it might be the right thing for you. If it doesn't it might still be the right thing for you, especially since C++11 and lambda using STL algorithm is very easy and versatile. If you don't know how to use lambda go to learn it immediately!

To turn your STL algorithm into a multi-threaded one just add

#include <opencog/util/oc_omp.h>

in your file and type explicitly the namespace OMP_ALGO in front of your STL algorithm.

for instance if you code was

for_each(my_vector.begin(), my_vector.end(), f);

replace it by

OMP_ALGO::for_each(my_vector.begin(), my_vector.end(), f);

This might work fine out of the box. If it doesn't, maybe you need to tweak some parameters. Note also that it only works for containers with random access iterators, in that respect you might find the function

random_access_view

in

opencog/util/functional.h

helpful at is returns a vector of const pointers of a given container.

Regarding tweaking OpenMP, in

opencog/util/oc_omp.h

there is a function

setting_omp

that takes a few parameters to help you tweak OpenMP. Currently it only takes

  1. the number of threads to use (that include the parent thread which is also used for the computation)
  2. the minimum number of iterations to trigger parallelization

The minimum number of iterations is set to 100 by default but not for all STL algo, take a look at setting_omp and improve it if necessary, that link will help you.

Of course don't forget to call setting_omp in your code (at least once before the parallelization takes place) to enable the setting.

Last but not least, if

f

is accessing shared resources you will probably need to add some locks. See Section Shared Resources.

C++11 async

C++11 offers a very convenient function to wrap a function call so that it runs in its own thread.

Just add the following include

#include <future>

and replace for instance

result_type y = f(x1, x2);
do_stuff();
do_stuff_with_y(y);

by

std::future<my_result_type> f_async = std::async(f, x1, x2);
do_stuff();
do_stuff_with_y(f_async.get());

And that's it, f(x1, x2) and do_stuff() are gonna run in parallel. If it doesn't work maybe you need to force async to really run asynchronously, by default async decides alone and to my experience it doesn't always decides right. The following piece of code would become

std::future<my_result_type> f_async = std::async(std::launch::async, f, x1, x2);
do_stuff();
do_stuff_with_y(f_async.get());

This makes parallelizing recursive function pretty easy, although if you need to force async to run asynchronously you may need to control whether it should be asynchronous or not in you recursive calls otherwise, if everything is asynchronous you'll get a lot of overhead because of threading management.

Have a look at the function

get_nondominated_rec

in file

opencog/learning/moses/moses/metapopulation.h

For an example of how to parallelize a recursive function.

Shared Resources

C++11 offers mutual exclusion locks.

What you usually need is

  1. a unique lock, for writing
  2. a shared lock, for reading

At first I was using only unique lock, but it can really slow down your computation (depending on your case of course). So use BOTH.

Here's a simple example, add the following typedefs in your code

typedef std::shared_mutex shared_mutex;
typedef std::shared_lock<shared_mutex> shared_lock;
typedef std::unique_lock<shared_mutex> unique_lock;

Instantiate your shared mutex

shared_mutex my_mutex;

When you need to write, do

{
    unique_lock lock(my_mutex);
    write_stuff();
}

When you need to read, do

{
    shared_lock lock(my_mutex);
    read_stuff();
}

the code is put in a block so that the lock is automatically released when the block ends.

If you want your block to return something you can use lambda

my_object_to_read = [&]() {
    shared_lock lock(my_mutex);
    return read_stuff();
}();

Finally, beware that locks can slow down the computation potentially severely, in the extreme case even make your algorithm use only a single physical core at a time + the threading overhead. If it seems to be an issue you might consider duplicating/merging the resource rather than locking it.

Again you can have a look at

opencog/learning/moses/moses/metapopulation.h

for an example. In the functions get_nondominated_rec and get_nondominated_disjoint_rec I have contoured that locking problem by only manipulating vectors of pointers. The recursive implementation using vectors of pointers would actually run faster than its iterative counterpart! Then parallelizing it was a child play using async as explained in the Section above.

Debugging

C++ multi-threading errors can be hard to debug. Especially since most of the time you just end-up with a sefault left with no clue about what has gone wrong.

Fortunately there are 2 tools that can really help you there:

  1. GDB, handles multi-thread, you can switch from one thread to another, see the stack, etc
  2. Valgring or more precisely Helgrind. If you have a data race issue this tool will immediately catch it, saving you hours if not days of sweating and cursing ;-).