# Feature selection

In regression analysis, one has, in general, a table of values or **features**, and the goal of predicting one feature, the **target**, (located in one column of the table) based on the values of the other features (in other columns). It is common that only some of the features are useful predictors of the target, while the others are mostly just noise. Regression analysis becomes difficult if there are too many input variables, and so regression can be improved simply by filtering out the useless features. Doing this is generically called **feature selection** or **dimensional reduction**. Opencog includes a program, *feature-selection*, that does this, using one of several different algorithms.

In general, feature selection is done by looking for a correlation between the target feature, and other features. This correlation can be computed in a number of different ways; the *feature-selection* program uses mutual information. A strong correlation between the featureset and the target variable suggests that the featureset is the appropriate one for performing regression analysis.

It can happen that two (or more) features are strongly correlated with each-other, and that only one of these is needed as a predictor. Adding any one of these features to the featureset will improve the correlation considerably, but adding any of the others will have little or no effect. Such sets of features are called **redundant**. A good selection algorithm will select only one of the features of the redundant set, omitting the others.

For OpenCog, feature selection is used as a front-end to the MOSES machine learning algorithm. It has proved to be convenient to maintain a feature selection utility tailored to the needs of MOSES. Certainly, it should be possible to use other open-source dimensional-reduction systems, such as InfoSel++ or mRMR).

## Algorithms

Three distinct algorithms are implemented. Two are loosely similar to the "Minimum Redundancy Maximum Relevance" algorithm, a third uses MOSES itself.

### Mutual Information

An important ingredient to all of the algorithms is the mutual information between features and the target variable.

Given the a target variable *Y* and the input features *X*_{1}, ..., *X*_{n}, the mutual information *MI*(*Y*;*X*_{i}) between a single feature *X*_{i} and *Y* is defined as

*MI*(*Y*;*X*_{i}) =*H*(*Y*) +*H*(*X*_{i}) -*H*(*Y*,*X*_{i})

where *H*(*Y*, *X*_{i}) is the joint entropy of *Y* and *X*_{i}.

For the purposes of this algorithm, the joint mutual information between *Y* and a subset *X*_{1}, ..., *X*_{k} is given by

*MI*(*Y*;*X*_{1}, ...,*X*_{k}) =*H*(*Y*) +*H*(*X*_{1},...,*X*_{k}) -*H*(*Y*,*X*_{1}, ...,*X*_{k})

#### Incremental algo

The so-called "incremental" algorithm, as implemented in *feature-selection*, computes the mutual information *MI*(*Y*;*X*_{i}) between the target *Y* and single features *X*_{i}, adding them, one at a time, to the featureset when the MI value is above a given threshold. It then removes redundant features, one at a time. Here, a redundant feature is one that, when removed from the set
*X*_{1}, ..., *X*_{k}, does not cause the joint mutual information *MI*(*Y*;*X*_{1}, ..., *X*_{k}) to decrease by less than *fraction*threshold*. That is, its failure to reduce the joint mutual information is taken to be a sign of its redundancy.

The incremental algorithm can be instructed to consider pairs *MI*(*Y*;*X*_{i}, *X*_{j}), or even higher orders, when adding to the output feature set. However, this rapidly becomes computationally expensive, as there is a combinatorial explosion of these. Computation time goes as *O*(*M*^{K}) for *M* input features, in the order *K*. More precisely, there are (*M* choose *K*) sets, and one may approximate *O*(*M* choose *K*) = *O*(*M*^{K}) for small *K*.

The incremental algo will typically find a feature set somewhere near to the "optimal" maximum-MI feature set. One can do better.

#### Maximum MI algo

The so-called "maximum MI" algorithm makes the assumption that the overall landscape of joint mutual information *MI*(*Y*;*X*_{1}, ..., *X*_{k}) is not "deceptive" as a function of the subset *X*_{1}, ..., *X*_{k}, and that any local maxima are not very tall, as compared to adding a handful of *X*_{i} to the feature set. If these assumptions are satisfied (and they seem to be, for most common datasets), then a simple, straight-forward hill-climbing-style algorithm can be employed.

In this algorithm, the mutual information *MI*(*Y*;*X*_{i}) is computed for each of the *M* different features *X*_{i} in the input. Of these, the highest *N* scorers are kept; the rest are discarded. Next, to each of these, a second feature, from the input set, is added. The resulting *M*×*N* scores *MI*(*Y*;*X*_{i}, *X*_{j}) are evaluated. Of these, the highest *N* are kept; the rest discarded. The process is repeated, for the *M*×*N* triples *MI*(*Y*;*X*_{i}, *X*_{j}, *X*_{k}), and so on, until either the MI fails to increase (by a step size of *threshold*) or if the desired number of features has been reached.

By default, *N*=100. In essence, the hill-climbing assumption holds if local maxima can be avoided by considering any of 100 different high-scoring paths around them. This seems to be a reasonable assumption.

Note that this algo will find the maximal MI feature set (given the assumptions). It avoids the combinatorial explosion of the "incremental" algorithm; however, it will have a somewhat longer running time than the *K*=1 incremental algorithm (but will typically be much faster when *K* > 2).

### MOSES-based algorithm

MOSES-based feature selection: Keeps the X_{i}'s that maximize

- Score
_{Y}(*X*_{i})

or for tuples

- Score
_{Y}(*X*_{i1}, ...,*X*_{im})

The algo will start with individual variables and incrementally tries pairs, triplets, etc, among the variables that have not contributed in passing the threshold (or below a certain threshold in the case of the entropy-based selection).

Note: this algo is currently non-functional.

## Source, Binaries, Documentation

Source code, a man page, and a diary of testing/results can be found in the directory

opencog/learning/feature-selection

A standalone executable is built at

opencog/learning/fearture-selection/main

A copy of the man page is on the wiki: feature-selection man page.