Improving Single Machine Performance

From OpenCog
Jump to: navigation, search

Introduction

These are preliminary notes concerning the project of achieving high performance within a single machine, via a combination of design choices. The goals are:

  • Enable high performance within a single process and across multiple process
  • Come up with mechanism and policies for coordination among processes
  • Get rid of the mandatory requirement for MindAgents to be run via the round robin scheduler, and of the mandatory notion of a cycle as the only way to coordinate MindAgents operating on the same Atomspace.
  • Flesh out the use of multiple MindAgents in the same machine when appropriate.

In terms of the Concentric Rings Architecture View this is about optimizing the performance of cognitive processes on Ring 2, and in that process, probably making improvements to Ring 1.


Current Architecture

We commonly think of the current architecture as single-threaded, although that's not strictly true. Within the CogServer we have a thread that runs MindAgents, through the round-robin scheduler. This scheduler implements the notion of a cycle, and things that happen inside the cycle happen in a single-threaded, cooperative way. See the diagram below.

Figure 1. Single threaded execution by the scheduler


Poor Man's Multithreading for Atomese

However, that isn't the only way cognitive processes can be run on the CogServer. We also have the NetworkServer, which listens to connections. When clients connect (typically via telnet), the connection is handled by a new thread, and cognitive processes can be run from that new thread, e.g. via cog-execute!. Multiple connections could call cog-execute! on different schemata and thus enable cognitive processes running in parallel in the same CogServer through Atomese. This was designed for code that will finish execution eventually (in fact, it was designed for code that will finish execution reasonably quickly) and I don't know what would happen if one used these connection handling threads to run a bunch of schema that never terminate. This has to be tested.

It looks like this approach gives us a poor man's design for parallel processing within a CogServer. It would work to, at least, enable parallel execution of request-response code (such as a chatbot) in parallel with long running cognitive processes handled by the CogServer scheduler (such as ECAN). If we need multiple long running processes in parallel, this model may break down.

Figure 2. Multi-threaded execution via cog-execute and the scheduler

Threads and Processes

We want to have multiple cognitive processes running in parallel, so we can get high performance when executing complex tasks, and so we can provide snappy responsiveness for applications like a dialog system or robot control.

First, a note on terminology. These cognitive processes can be MindAgents or they can be implemented as different objects (once we're rid of the scheduler, we're rid of the main incentive to use the MindAgent design). I will refer to them as MindAgents in this document just to avoid confusion, because I'm also talking about coordination among different OS processes.

There are two obvious ways to run MindAgents in parallel:

  1. A multi-threaded CogServer where some MindAgents run on their dedicated worker threads, and, optionally, others run on a thread controlled by the existing Scheduler (but where the notion of the cycle is rendered meaningless).
  2. Multiple CogServers, each running a single thread for MindAgents (along with the NetworkServer for connections). This is the only option for Python code due to the GIL.

In the single process case we need to deal with the complications of multi-threaded C++ programming, which is notoriously error-prone. However, we benefit from having the Atomspace in the same process.

In the multiple process case, the Linux scheduler takes care of allocating CPU to different MindAgents based on process priority (which is set by the application; orchestration scripts or libraries would be used to define the relative priorities of different processes). Dynamically changing those priorities is possible through an extra process, but it doesn't seem necessary for applications in the immediate future.

In the multiple process scenario, we have simpler coding for running MindAgents, but need to handle the Atomspace issue. Again, there are two variations:

  1. One process holds the main Atomspace object, and the others communicate with it to manipulate that Atomspace. These processes could hold local caches, which can be stripped-down Atomspaces (or AtomTables).
  2. Each process holds its own fully-features Atomspace, and there's no "main Atomspace". Each process can do its cognitive tasks locally, but we need to synchronize the Atomspaces. Effectively, we have a distributed Atomspace in a single machine.

In practice, we probably want to support all three variations: each CogServer process should be thread safe and able to run MindAgents in multiple threads without the notion of a cycle; and there should be ways for processes to communicate, exchanging requests as well as Atoms; and there should be policy for keeping the Atomspace in reasonably consistent shape (with no guarantee of full consistency all the time).

The big question is how to get there. There are three natural tasks, none of them trivial.


Multi-threaded CogServer

This is conceptually straightforward. The Atomspace is thread-safe. What needs to be done is:

  1. Enable execution of MindAgents (or any cognitive processes, but there should be some appropriate design here) in multiple threads. A configuration file should enumerate the threads and the objects to be run inside each thread.
  2. If a thread is to run multiple objects, they should be MindAgents and will be run under the round-robin scheduler. Otherwise, the single object gets all the CPU time given to its thread.
  3. The notion of a cycle is deprecated.
Figure 3. Multi-threaded execution via new scheduler, long running extra threads and cog-execute

With the Atomspace being thread-safe, the above should be reasonably simple to run, as long as the Atomspace is the only shared data structure. These MindAgents and processes can't share data (including direct pointers to Atoms) unless that sharing is also made thread-safe. So the simplest thing is to ensure that they only share Atoms via the Atomspace and also that they communicate via the Atomspace.

It's of course entirely possible that the Atomspace will require performance tuning for this to work well, but we need some concrete use cases to drive that performance tuning.

It's also possible (likely) that I'm ignoring some reason why this would be harder than it currently seems to me.


Inter-process Communication

If we have multiple processes, they need to communicate in some way, both to send requests and receive responses, and to share Atoms.

As with MindAgents being run in parallel in the same process, the simplest alternative is for them to communicate via the Atomspace. In the multi-process scenario this means using the backing store -- saving and loading specific Atoms which embody requests, and the responses are stored as Atoms linked to the requests by some special convention. This isn't elegant (processes have to poll the DB server for the Atoms of interest) and not likely to be very performant. But it's simple and it's supported by existing code.

A more performant solution would be to use ZeroMQ (or something like that, with similar or superior performance profile) for communication between processes. Requests and responses are the easiest thing to implement this way, and this reduces polling to the DB.

A next step is to use ZeroMQ to send Atoms around, or to share important Atoms using ZeroMQ's shared memory functionality. This could be used for important Atoms, greatly reducing how often one has to go to the DB. The exact configuration of which Atoms are shared between which processes is an application design choice, and it may be a complicated one, but there's no pragmatic way around that.

Shared memory can cause polling overhead as well, so there's some tuning involved. One could think of a pool of "important and active" Atoms, i.e., the attentional focus, and assume polling for those Atoms is considered worthwhile.

Figure 4. Multiple, potentially multi-threaded, CogServers in the same machine


Multiple Atomspaces and Synchronization

I don't have any concrete ideas at the moment for this, but it's clear that we'll need to design policies for Atomspace synchronization in order to get multiple processes working on the same shared data. The PostgreSQL backing store provides one mechanism for that, and ZeroMQ shared memory provides another. But we still need to decide on which Atoms to synchronize, how often, etc. This also relates to getting ECAN to work on these larger, distributed Atomspaces.


Next Steps?

Some potentially useful next steps:

  1. Prototype and benchmark current multi-threaded performance (via multiple connections and cog-execute!)
  2. Understand in detail changes needed to the CogServer so it can run multiple threads with long-running MindAgents that don't terminate.
  3. Understand in detail changes needed to ECAN and any other MindAgents to remove the concept of the cognitive cycle
  4. Prototype ZeroMQ for communications and shared memory and benchmark this, compare with multiple threads in same process for and idea of the overhead.