From OpenCog
Jump to: navigation, search

Identifying Objects from Polygonal Perceptions

This section describes a strategy for identifying objects in a visual scene that is perceived as a set of polygons. It is not a thoroughly detailed algorithmic approach, but rather a high-level description of how this may be done effectively within the OCP design. It is offered here largely as an illustration of how specialized perceptual data processing algorithms may be designed and implemented within the OCP framework.

We deal here with an agent whose perception of the world, at any point in time, is understood to consist of a set of polygons, each one described in terms of a list of corners. The corners may be assumed to be described in coordinates relative to the viewing eye of the agent.

What I mean by "identifying objects" here is something very simple. I don't mean identifying that a particular object is a chair, or is Ben's brown chair, or anything like that — I simply mean identifying that a given collection of polygons is meaningfully grouped into an object. That is the task considered here.

Of course, not all approaches to polygon-based vision processing would require this sort of phase: it would be possible, as an alternative, to simply compare the set of polygons in the visual field to a database of prior experience and then do object identification (in the present sense) based on this database-comparison. But in the approach described in this page, we have chosen not to take this approach, and to begin instead with an automated segmentation of the set of perceived polygons into a set of objects.

Algorithm Overview

The algorithm described here falls into three stages:

  1. Recognizing PersistentPolygonNodes (PPNodes) from PolygonNodes.
  2. Creating Adjacency Graphs from PPNodes.
  3. Clustering in the Adjacency Graph.

Each of these stages involves a bunch of details, not all of which have been fully resolved: this document just gives a conceptual overview.

I will speak in terms of objects such as PolygonNode, PPNode and so forth, because inside the OCP AI engine, observed and conceived entities are represented as nodes in an graph. However, this terminology is not very important here, and what I call a PolygonNode here could just as well be represented in a host of other ways, within the overall OCP framework.

Recognizing PersistentPolygonNodes (PPNodes) from PolygonNodes

A PolygonNode represents a polygon observed at a point in time. A PPNode represents a series of PolygonNodes that are heuristically guessed to represent the same PolygonNode at different moments in time.

Before "object permanence" is learned, the heuristics for recognizing PPNodes will only work in the case of a persistent polygon that, over an interval of time, is experiencing relative motion within the visual field, but is never leaving the visual field. For example some reasonable heuristics are: If P1 occurs at time t, P2 occurs at time s where s is very close to t, and P1 are similar in shape, size and color and position, then P1 and P2 should be grouped together into the same PPNode.

More advanced heuristics would deal carefully with the case where some of these similarities did not hold, which would allow us to deal e.g. with the case where an object was rapidly changing color.

In the case where the polygons are coming from a simulation world like OpenSim, then from our positions as programmers and world-masters, we can see that what a PPNode is supposed to correspond to is a certain side of a certain OpenSim object; but it doesn't appear immediately that way to OCP when controlling an agent in OpenSim since OCP isn't perceiving OpenSim objects, it's perceiving polygons. On the other hand, in the case where polygons are coming from software that postprocesses the output of a Lidar based vision system, then the piecing together of PPNodes from PolygonNodes is really necessary.

Creating Adjacency Graphs from PPNodes

Having identified PPNodes, we may then draw a graph between PPNodes (called an "Adjacency Graph"), wherein the links are AdjacencyLinks (with weights indicating the degree to which the two PPNodes tend to be adjacent, over time). [Note for later: A more refined graph might also involve SpatialCoordinationLinks (with weights indicating the degree to which the vector between the centroids of the two PPNodes tends to be consistent over time).]

We may then use this graph to do object identification:

  • First-level objects may be defined as clusters in the graph of PPNodes.
  • One may also make a graph between first-level objects, an ObjectGraph with the same kinds of links as in the PPGraph. Second-level objects may be defined as clusters in the ObjectGraph.

The "strength" of an identified object may be assigned as the "quality" of the cluster (measured in terms of how tight the cluster is, and how well separated from other clusters.)

As an example, consider a robot with two parts: a body and a head. The whole body may have a moderate strength as a first-level object, but the head and body individually will have significantly greater strengths as first-level objects. On the other hand, the whole body should have a pretty strong strength as a second-level object.

It seems convenient (though not necessary) a PhysicalObjectNode type to represent the objects recognized via clustering; but the first versus second level object distinction should definitely not be made on the Atom type level.

Building the adjacency graph requires a mathematical formula defining what it means for two PPNodes to be adjacent. Creating this formula may require a little tuning. For instance, the adjacency between two PPNodes PP1 and PP2 may be defined as the average over time of the adjacency of the PolygonNodes PP1(t) and PP2(t) observed at each time t. (A p'th power average may be used here, and different values of p may be tried.) Then, the adjacency between two (simultaneous) PolygonNodes P1 and P2 may be defined as the average over all x in P1 of the minimum over all y in P2 of sim(x,y), where sim(,) is an appropriately scaled similarity function. This latter average could arguably be made a maximum; my inclination would be to make it a p'th power average with large p, which approximates a maximum.

Clustering in the Adjacency Graph.

As noted above, the idea is that objects correspond to clusters in the adjacency graph. This means we need to implement some hierarchical clustering algorithm that is tailored to find clusters in symmetric weighted graphs. Probably some decent algorithms of this character exist, if not it would be fairly easy to define one, e.g. by mapping some standard hierarchical clustering algorithm to deal with graphs rather than vectors.

Clusters will then be mapped into OpenSIMObjectNodes, interlinked appropriately via PhysicalPartLinks and AdjacencyLinks. (E.g. there would be a PhysicalPartLink between the OpenSIMObjectNode representing a head and the OpenSIMObjectNode representing a body (where the body is considered as including the head)).


It seems probable that, for simple scenes consisting of a small number of simple objects, clustering for object recognition will be very unproblematic. However, there are two cases that seem potentially tricky:

  • Sub-objects: e.g. the head and torso of a body, which may move separately; or the nose of the head, which may wiggle; or the legs of a walking dog; etc.
  • Coordinated objects: e.g. if a character's hat is on a table, and then later on his head, then when it's on his head we basically want to consider him and his hat as the same object, for some purposes.

These examples show that partitioning a scene into objects is a borderline-cognitive rather than purely lower-level-perceptual task, which cannot be hard-wired in any very simple way.

I also note that, for complex scenes, clustering may not work perfectly for object recognition and some reasoning may be needed to aid with the process. Intuitively, these may correspond to scenes that, in human perceptual psychology, require conscious attention and focus in order to be accurately and usefully perceived.