Feature selection

From OpenCog
Jump to: navigation, search

In regression analysis, one has, in general, a table of values or features, and the goal of predicting one feature, the target, (located in one column of the table) based on the values of the other features (in other columns). It is common that only some of the features are useful predictors of the target, while the others are mostly just noise. Regression analysis becomes difficult if there are too many input variables, and so regression can be improved simply by filtering out the useless features. Doing this is generically called feature selection or dimensional reduction. Opencog includes a program, feature-selection, that does this, using one of several different algorithms.

In general, feature selection is done by looking for a correlation between the target feature, and other features. This correlation can be computed in a number of different ways; the feature-selection program uses mutual information. A strong correlation between the featureset and the target variable suggests that the featureset is the appropriate one for performing regression analysis.

It can happen that two (or more) features are strongly correlated with each-other, and that only one of these is needed as a predictor. Adding any one of these features to the featureset will improve the correlation considerably, but adding any of the others will have little or no effect. Such sets of features are called redundant. A good selection algorithm will select only one of the features of the redundant set, omitting the others.

For OpenCog, feature selection is used as a front-end to the MOSES machine learning algorithm. It has proved to be convenient to maintain a feature selection utility tailored to the needs of MOSES. Certainly, it should be possible to use other open-source dimensional-reduction systems, such as InfoSel++ or mRMR).


Three distinct algorithms are implemented. Two are loosely similar to the "Minimum Redundancy Maximum Relevance" algorithm, a third uses MOSES itself.

Mutual Information

An important ingredient to all of the algorithms is the mutual information between features and the target variable.

Given the a target variable Y and the input features X1, ..., Xn, the mutual information MI(Y;Xi) between a single feature Xi and Y is defined as

MI(Y;Xi) = H(Y) + H(Xi) - H(Y, Xi)

where H(Y, Xi) is the joint entropy of Y and Xi.

For the purposes of this algorithm, the joint mutual information between Y and a subset X1, ..., Xk is given by

MI(Y;X1, ..., Xk) = H(Y) + H(X1,...,Xk ) - H(Y, X1, ..., Xk)

Incremental algo

The so-called "incremental" algorithm, as implemented in feature-selection, computes the mutual information MI(Y;Xi) between the target Y and single features Xi, adding them, one at a time, to the featureset when the MI value is above a given threshold. It then removes redundant features, one at a time. Here, a redundant feature is one that, when removed from the set X1, ..., Xk, does not cause the joint mutual information MI(Y;X1, ..., Xk) to decrease by less than fraction*threshold. That is, its failure to reduce the joint mutual information is taken to be a sign of its redundancy.

The incremental algorithm can be instructed to consider pairs MI(Y;Xi, Xj), or even higher orders, when adding to the output feature set. However, this rapidly becomes computationally expensive, as there is a combinatorial explosion of these. Computation time goes as O(MK) for M input features, in the order K. More precisely, there are (M choose K) sets, and one may approximate O(M choose K) = O(MK) for small K.

The incremental algo will typically find a feature set somewhere near to the "optimal" maximum-MI feature set. One can do better.

Maximum MI algo

The so-called "maximum MI" algorithm makes the assumption that the overall landscape of joint mutual information MI(Y;X1, ..., Xk) is not "deceptive" as a function of the subset X1, ..., Xk, and that any local maxima are not very tall, as compared to adding a handful of Xi to the feature set. If these assumptions are satisfied (and they seem to be, for most common datasets), then a simple, straight-forward hill-climbing-style algorithm can be employed.

In this algorithm, the mutual information MI(Y;Xi) is computed for each of the M different features Xi in the input. Of these, the highest N scorers are kept; the rest are discarded. Next, to each of these, a second feature, from the input set, is added. The resulting M×N scores MI(Y;Xi, Xj) are evaluated. Of these, the highest N are kept; the rest discarded. The process is repeated, for the M×N triples MI(Y;Xi, Xj, Xk), and so on, until either the MI fails to increase (by a step size of threshold) or if the desired number of features has been reached.

By default, N=100. In essence, the hill-climbing assumption holds if local maxima can be avoided by considering any of 100 different high-scoring paths around them. This seems to be a reasonable assumption.

Note that this algo will find the maximal MI feature set (given the assumptions). It avoids the combinatorial explosion of the "incremental" algorithm; however, it will have a somewhat longer running time than the K=1 incremental algorithm (but will typically be much faster when K > 2).

MOSES-based algorithm

MOSES-based feature selection: Keeps the Xi's that maximize


or for tuples

ScoreY(Xi1, ..., Xim)

The algo will start with individual variables and incrementally tries pairs, triplets, etc, among the variables that have not contributed in passing the threshold (or below a certain threshold in the case of the entropy-based selection).

Note: this algo is currently non-functional.

Source, Binaries, Documentation

Source code, a man page, and a diary of testing/results can be found in the directory


A standalone executable is built at


A copy of the man page is on the wiki: feature-selection man page.