Clustering

From OpenCog
Jump to: navigation, search

This page, initially written by Ben Goertzel in Dec. 2014, summarizes a direction for getting clustering initially working in OpenCog as a heuristic strategy for concept formation.

See alto the page Agglomerative Clustering in Atomspace using the URE which outlines a different approach to clustering in OpenCog.

The basic idea is simple:

  • project Atoms into n-dimensional vectors
  • apply existing clustering algorithms on n-dimensional vectors
  • create new Atoms embodying these clusters

This will be generically useful, but is initially motivated by the requirements of Unsupervised Language Learning Experimentation.

Background: Dimensional Embedding

The strategy proposed here makes use of the dimensional embedding module described conceptually in

http://wiki.opencog.org/w/OpenCogPrime:DimensionalEmbedding

and with code here

https://github.com/opencog/opencog/tree/master/opencog/learning/dimensionalembedding

The code could use some optimization in the way it chooses initial pivot nodes, but this isn't a huge problem to solve (this just requires replacing Dijkstra's algorithm for pivot choice w/ something faster and more heuristic. Maybe just replacing Dijkstra's with A* would be a good idea.).

It would also be nice to have code estimating the correlation between distance in the embedding space and distance in the AtomSpace.

Background: Clustering algorithms

There are many available clustering algorithms available in various machine learning toolkits. A subtlety however is that we don't really want batch clustering algorithms here, except perhaps for one-off research experiments. What we really want here are clustering algorithm implementations that can maintain a clustering ongoingly, and then incrementally update this clustering as new data is obtained (or as more of the existing data is incrementally considered). That is, in the lingo, we want "online" clustering algorithms. Most available clustering code is written to run in batch mode.

One source of a few decent online clustering algorithms is the MOA toolkit

http://moa.cms.waikato.ac.nz/details/stream-clustering/

This contains online versions of oldie-but-goodie algorithms like K-means (partitioning) and Cobweb (hierarchical), plus some more modern methods, accessible in Java via a nice API.

Connecting MOA to do clustering on an embedding space reflecting an Atomspace, where the embedding space is updated in quasi-real-time as the Atomspace changes, would be a good start toward getting Atomspace clustering work.

We might end up wanting to implement clustering directly on Atoms at some point, rather than relying on a vector-based approach. But for initial experimentation it seems nicer to try a vector-based approach, as it lets us rely on existing well-optimized code and associated visualization widgets.

A more interesting clustering algorithm is this one

http://webuser.unicas.it/fontanella/papers/WSC04.pdf

which uses a combination of genetic programming (to learn rules characterizing clusters) and centroid-based clustering. It could be implemented using MOSES to work with the AtomSpace. But that would be a significant undertaking.

Implementation Sketch

We can create

  • ClusterNode as a subclass of ConceptNode.
  • ClusteringNode as a new Node type
  • A MemberLink from a ClusterNode to the ClusteringNode representing the clustering that it's part of

Then we can create a MindAgent performing clustering. The MindAgent will create a ClusteringNode and retain a reference to that ClusteringNode, and then iteratively update the clustering associated with that ClusteringNode.

There should also be an EmbeddingAgent, that periodically updates the dimensional embedding pivot Atoms, thus changing the embedding based on which the ClusteringAgent does its work.

Actually there may be many EmbeddingAgents and ClusteringAgents operative in a given OpenCog at a given time. When creating an agent of one of these types, one should specify what Atom types it is trying to embed or cluster, and based on what type of links. If an EmbeddingAgent is asked to embed based on ConceptNodes and their SimilarityLinks, then it should create and maintain and embedding space structure labeled with the tag "ConceptNode; SimilarityLink" (note that it could also be something like "ConceptNode, WordNode; SimilarityLink, InheritanceLink", or whatever....). If a ClusteringAgent is asked to cluster based on ConceptNodes and their SimilarityLinks, then it will immediately look for an embedding space structure with the tag "ConceptNode; SimilarityLink" , and throw an error if it doesn't find one.