OpenCogPrime:DimensionalEmbedding
Dimensional Embedding
This chapter discuss a particular mathematical mechanism that is useful for several purposes: the embedding of Atoms in the OpenCog AtomSpace into n-dimensional space.
In general SMEPH terms, this provides a tool for representational transformation of SMEPH hypergraphs into sets of points in dimensional space. For most purposes, the hypergraph representation is more insightful, but there are also sometimes advantages to a dimensional representation, due to the massive body of sophisticated mathematical and computational mechanisms associated with numerical vectors.
In practical OCP terms, this embedding has applications to PLN inference control, and to the guidance of instance generation in PEL learning of Combo trees. It is also, in itself, a valuable and interesting heuristic for sculpting the link topology of a OCP AtomSpace. The basic dimensional embedding algorithm described here is not original to OCP, but it has not previously been applied in this context.
The role of dimensional embedding in OCP is interesting conceptually as well as technically. The human brain's neural net is embedded in 3D space in a way that OCP's is not. In many ways, the unconstrained geometry of OCP's Atom network is superior. But there are some cases (e.g. PLN control, and PEL guidance) where dimensional geometry provides a useful heuristic for constraining a huge search space, via providing a compact way of storing a large amount of information. Dimensionally embedding Atoms lets OCP be dimensional like the brain when it needs to be, yet with the freedom of nondimensionality the rest of the time. This dual strategy is one that may be of value for AGI generally beyond the OCP design.
There is an obvious way to project OCP or SMEPH Atoms into n-dimensional space, by assigning each Atom a numerical vector based on the weights of its links. But this is not a terribly practical observation, because the vectors obtained in this way will live, potentially, in millions- or billions-dimensional space. What we are doing here is a bit different. We are defining more specific embeddings, each one based on a particular link type or set of link types. And we are doing the embedding into a space whose dimensionality is high but not too high, e.g. n=50.
The philosophy underlying the ideas proposed here is similar to that underlying Principal Components Analysis (PCA) in statistics (Joliffe, 2002). The n-dimensional spaces we define here, like those used in PCA or LSI, are defined by sets of orthogonal concepts extracted from the original space of concepts. The difference is that PCA and LSI work on spaces of entities defined by feature vectors, whereas the methods described here work for entities defined as nodes in weighted graphs. There is no precise notion of orthogonality for nodes in a weighted graph, but one can introduce a reasonable proxy.
Link Based Dimensional Embedding
In this section we define the type of dimensional embedding that we will be talking about. For concreteness we will speak in terms of OCP nodes and links but the discussion applies equally well to SMEPH vertices and edges, and in fact even more generally than that.
A link based dimensional embedding is defined as a mapping that maps a set of OCP Atoms into points in an n-dimensional real space, by:
- mapping link strength into coordinate values in an embedding space, and
- representing nodes as points in this embedding space, using the coordinate values defined by the strengths of their links.
In the usual case, a dimensional embedding is formed from links of a single type, or from links whose types are very closely related (e.g. from all symmetrical logical links).
Mapping all the link strengths of the links of a given type into coordinate values in a dimensional space is a simple, but not a very effective strategy. The approach described here is based on strategically choosing a subset of particular links and forming coordinate values from them. The choice of links is based on the desire for a correspondence between the metric structure of the embedding space, and the metric structure implicit in the weights of the links of the type being embedded.
More formally, let proj(A) denote the point in R^n corresponding to the Atom A. Then if, for example, we are doing an embedding based on SimilarityLinks, we want there to be a strong correlation between:
(SimilarityLink A B).tv.s
and
d_E( proj(A), proj(B) )
where d_E denotes the Euclidean distance on the embedding space. This is a simple case because SimilarityLink is symmetric. Dealing with asymmetric links like InheritanceLinks is a little subtler, and will be done below in the context of inference control.
Larger dimensions generally allow greater correlation, but add complexity. If one chooses the dimensionality equal to the number of nodes in the graph, there is really no point in doing the embedding. On the other hand, if one tries to project a huge and complex graph into 1 or 2 dimensions, one is bound to lose a lot of important structure. The optimally useful embedding will be into a space whose dimension is large but not too large.
For internal OCP inference purposes, we should generally use a moderately high-dimensional embedding space, say n=50 or n=100.