Perceptual Pattern Mining
This section outlines an approach to perceptual pattern mining, designed for application in virtual worlds in which objects are represented in terms of polygons; but also with a view toward extensibility for more general usage.
The perceptual pattern mining problem is conceived as: Given the stream of perceptual data coming in from virtual-world, transform it into a series of predicates that are directly useful for inference in creating reasonably small inference trails that lead to results such as useful goal refinement, or useful schemata. This involves recognizing recurring patterns among perceptual data, in a manner guided by basic relationships of spatial and temporal proximity.
Modes of Visual Perception in Virtual worlds
The most complex, high-bandwidth perceptual data coming in from typical virtual world is visual data, and the treatment of visual data is a matter of some subtlety. There are basically three modes in which virtual-world may present visual data to OCP (or any other system):
- Object vision: OCP receives information about polygonal objects and their colors, textures and coordinates (each object is a set of contiguous polygons, and sometimes objects have "type" information, e.g. cube or sphere)
- Polygon vision: OCP receives information about polygons and their colors, textures and coordinates
- Pixel vision: OCP receives information about pixels and their colors and coordinates
In each case, coordinates may be given either in "world coordinates" or in "relative coordinates" (relative to the gaze). This distinction is not a very big deal since within an architecture like OCP, supplying schemata for coordinate transformation is trivial; and, even if treated as a machine learning task, this sort of coordinate transformation is not very difficult to learn. Our current approach is to prefer relative coordinates, as this approach is more natural in terms of modern Western human psychology; but we note that in some other cultures world coordinates are preferred and considered more psychologically natural.
At time of writing (July 2008) we have not done any work with pixel vision. We consider object vision to be too high-level for some purposes, and have settled on either polygon vision or a combination of polygon vision and object vision as the "right" approach for early AGI experiments in a virtual worlds context. The problem with object vision is that it removes the possibility for OCP to understand object segmentation. If, for instance, OCP perceives a person as a single object, then how can it recognize a head as a distinct sub-object? Feeding the system a pre-figured hierarchy of objects, sub-objects and so forth seems inappropriate in the context of an experiential learning system. On the other hand, the use of polygon vision instead of pixel vision seems to meet no such objections.
Beyond Generic Perceptual Pattern Mining
What is described in this section is a generic perceptual pattern mining approach for the virtual world context. However, it is not an exhaustive and complete approach to OCP perception, for two reasons
- In order to achieve adequate computational efficiency in perceptual processing in a complex virtual-world environment, it may be necessary to complement this general approach with a handful of additional, more highly specialized perceptual pattern mining subsystems. However, these will not be discussed significantly in this chapter, except in the following section which discusses object recognition.
- The pattern mining approach described here is not intrinsically, algorithmically restricted to the virtual-world case, although it is presented in the virtual-world context. However, it is also not intended for usage in extremely high-bandwidth perception scenarios: e.g. for real-time streaming video data coming in from a camera eye. In situations like the latter, pattern mining as described here must still be done, but given currently realistic restrictions on processing speed, it should be on the output of more specialized lower-level pattern recognition algorithms. These also are not treated here, except briefly in these introductory remarks.
As an example of a specialized perceptual pattern mining subsystem, a preliminary design sketch has been made for an object recognition subsystem; this is presented in the following subsection. However, we don't consider this kind of "perception engineering" issue as extremely critical to the OCP AGI design: the key point is that the stochastic, greedy perceptual pattern mining approach is capable of integration with a variety of more specialized perceptual heuristics, which may be incorporated as necessary to maximize perceptual performance.
Regarding preprocessing to deal with high-bandwidth data, some thought has gone into how to preprocess visual data of the type obtainable by a contemporary robot to make it amenable to the kind of perceptual pattern mining process described here. One approach, for example, involves multiple phases:
- Use a camera eye and a lidar sensor in tandem, so as to avoid having to deal with stereo vision
- Using the above two inputs, create a continuous 3D contour map of the perceived visual world
- Use standard mathematical transforms to polygon-ize the 3D contour map into a large set of small polygons
- Use heuristics to merge together the small polygons, obtaining a smaller set of larger polygons (but retaining the large set of small polygons for the system to reference in cases where a high level of detail is necessary)
- Feed the polygons into the perceptual pattern mining subsystem, analogously to the polygons that come in from virtual-world
In this approach, preprocessing is used to make the system see the physical world in a manner analogous to how it sees the virtual-world world.
Knowledge Representation for Perceptual Pattern Mining
Now we proceed toward describing how perceptual pattern mining may be carried out. In this subsection we discuss the representation of knowledge, and then in the following subsection we discuss the actual mining.
First, we may assume that each sensation from virtual-world is recorded as set of "transactions", each of which is of the form
Time, 3D coordinates, object type
Time, 3D coordinates, action type
Each transaction may also come with an additional list of (attribute, value) pairs, where the list of attributes is dependent upon the object or action type. Transactions are represented as Atoms, and don't need to be a specific Atom type — but are referred to here by the special name transactions simply to make the discussion clear.
Next, define a transaction template as a transaction with location and time information set to wild cards — and potentially, some other attributes set to wild cards. (These are implemented in terms of Atoms involving VariableNodes.)
For instance, some transaction templates in the current virtual-world might be informally represented as:
Reward Red cube kick move_forward Cube Cube, size 5 me Teacher
Basic Spatiotemporal Predicates
Next, I define here a set of "basic spatiotemporal predicates". These predicates apply to tuples of transactions, and also (by averaging) to tuples of transaction templates.
For two transactions A and B with the same time-stamp (or more generally, very close-by time-stamps), predicates such as
should be calculated. These are assumed fuzzy predicates with truth values in [0,1] for each argument list.
Also, for two transactions A and B that are very near to each other in time (judged by some threshold), the predicates
should be calculated (again, fuzzy predicates). For convenience, in this chapter I will denote these relationships by
SSeqAND A B SimAND A B
(where SSeqAND means short sequential AND, i.e. SeqAND with a short time lag.)
For instance if RedCubeObservation99 and TeacherObservation44 are specific transactions, then
near(RedCubeObservation99, TeacherObservation44) SimAND (RedCubeObservation99, TeacherObservation44)
together assess the degree to which this observed red cube and the observed Teacher are near each other in space and time. On the other hand,
near(red cube, Teacher)
indicates the degree to which the red cube and the teacher are often near each other.
Later we may want other fuzzy predicates too, such as size, e.g.
would be truer for cubes A whose size is closer to 5, and maximally true of cubes of size 5.
And perhaps later, if A and B are very close in spatial location to each other and not extremely far from each other in time, we also want to compute
(a different fuzzy predicate, with a different weighting on time lags).
Next we may conceive a transaction graph, whose nodes are transactions and whose links are labeled with labels like after, SimAND, SSeqAND, near, in_front_of, and so forth (and whose links are weighted as well).
We may also conceive a transaction template graph, whose nodes are transaction templates, and whose links are the same as in the transaction graph.
And finally, we may conceive a transaction template relationship graph (TTRG), whose nodes may be any of: transactions; transaction templates; basic spatiotemporal predicates evaluated at tuples of transactions or transaction templates.
Define a temporal conjunction as a conjunction involving SimultaneousAND and SequentialAND operators (including SSeqAND as a special case of SeqAND: the special case that interests us in the short term). The conjunction is therefore ordered, e.g.
A SSeqAND B SimAND C SSeqAND D
We may assume that the order of operations favors SimAND, so that no parenthesizing is necessary.
Next, define a basic spatiotemporal conjunction as a temporal conjunction that conjoins terms that are either
- transactions, or
- transaction templates, or
- basic spatiotemporal predicates applied to tuples of transactions or transaction templates
I.e. a basic spatiotemporal conjunction is a temporal conjunction of nodes from the transaction template relationship graph.
An example would be:
(hold ball) SimAND ( near(me, teacher) ) SSeqAND Reward
[this assumes that the hold action has an attribute that is the type of object held, so that
in the above temporal conjunction is a shorthand for the transaction template specified by
action type: hold object_held_type: ball
This example says that if the agent is holding the ball and is near the teacher then shortly after that, the agent will get a reward.
The Mining Task
The perceptual mining task, then, is to find basic spatiotemporal conjunctions that are interesting. What constitutes interestingness is multifactorial, and includes.
- involves important Atoms (e.g. Reward)
- has a high temporal cohesion (i.e. the strength of the time relationships embodied in the SimAND and SeqAND links is high)
- has a high spatial cohesion (i.e. the near() relationships have high strength)
- has a high frequency
- has a high surprise value (its frequency is far from what would be predicted by its component sub-conjunctions)
Note that a conjunction can be interesting without satisfying all these criteria; e.g. if it involves something important and has a high temporal cohesion, we want to find it regardless of its spatial cohesion.
For starters, for simple virtual-world experiments, defining "interestingness" as the combination frequency and temporal cohesion should be adequate.
A Mining Approach
One tractable approach to perceptual pattern mining is greedy and iterative, involving the following steps:
- Build an initial transaction template graph G
- Greedily mine some interesting basic spatiotemporal conjunctions from it, adding each interesting conjunction found as a new node in G (so that G becomes a transaction template relationship graph), repeating step 2 until boredom results or time runs out
The same TTRG may be maintained over time, but of course will require a robust forgetting mechanism once the history gets long or the environment gets nontrivially complex.
The greedy mining step may involve simply grabbing SeqAND or SimAND links with probability determined by the (importance and/or interestingness) of their targets, and the probabilistic strength and temporal strength of the temporal AND relationship, and then creating conjunctions based on these links (which then become new nodes in the TTRG, so they can be built up into larger conjunctions).