Feature-selection man page
Man page for feature selection (raw dump).
FEATURE-SELECTION(1) OpenCog Learning FEATURE-SELECTION(1) NAME feature-selection - dimensional reduction of datasets SYNOPSIS feature-selection -h|--help feature-selection --version feature-selection -i filename [-a algo] [-C num_features] [-T thresh‐ old] [-r random_seed] [-u target_feature] [-j jobs] [-F log_file] [-L] [-l loglevel] [-o output_file] [-D fraction] [-E tolerance] [-U num_terms] DESCRIPTION feature-selection is a command-line utility for performing dimensional reduction of large datasets, so as to prepare them for a later regres‐ sion analysis. It is intended to be used to filter data before it is fit with the moses machine-learning program, but it can be used in more general settings. EXAMPLES feature-selection -iraw_data.csv -ammi -C10 -oreduced.csv -Ffs.log Use the input file raw_data.csv as the input dataset. The input file must be a tab-separated or comma-separated table of values. The dependent feature is assumed to be in the first column. The mmi algorithm will be used to select 10 different features that are the most strongly correlated with the dependent feature. The output will be written to the file reduced.csv, consisting of the dependent feature, in the first column, followed by the 10 most predictive columns. A logfile of algorithm progress will be written to fs.log. feature-selection -iraw_data.csv -ammi -C10 -oreduced.csv -uQ3 As above, but use the column labelled Q3 as the target feature. This assumes that the input dataset contains columns labels as the first row in the dataset. The selected column labels are reproduced in the output file. OVERVIEW In regression analysis, one has, in general, a table of values or ´fea‐ tures´, and the goal of predicting one feature, the ´target´, (located in one column of the table) based on the values of the other features (in other columns). It is common that only some of the features are useful predictors of the target, while the others are mostly just noise. Regression analysis becomes difficult if there are too many input variables, and so regression can be improved simply by filtering out the useless features. Doing this is generically called ´feature selection´ or ´dimensional reduction´. This program, feature-selec‐ tion, does this, using one of several different algorithms. In general, feature selection is done by looking for a correlation between the target feature, and other features. This correlation can be computed in a number of different ways; this program uses mutual information. A strong correlation between the featureset and the tar‐ get variable suggests that the featureset is the appropriate one for performing regression analysis. It can happen that two (or more) features are strongly correlated with each-other, and that only one of these is needed as a predictor. Adding any one of these features to the featureset will improve the correla‐ tion considerably, but adding any of the others will have little or no effect. Such sets of features are called ´redundant´. A good selec‐ tion algorithm will select only one of the features of the redundant set, omitting the others. This program accepts inputs in a fairly generic whitespace or comma delimited table format: whitespace may consist of repeated tabs or blanks; values may be comma-separated but padded with blanks. The input file may contain comment lines; these lines must start with a hash-mark (#), exclamation mark (!) or semi-colon (;) as the first character. Boolean values may be indicated with any of the characters 0,f,F and 1,t,T. Column values must be either boolean or numeric; at this time, non-numeric data, including dates or columns containing currency sym‐ bols are not supported. OPTIONS Options fall into three classes: options for controlling input and out‐ put, options for controlling the algorithm behavior, and general options. General options -h, --help Print command options and exit. -j num, --jobs=num Allocate num threads for feature selection. Using multiple threads on multi-core processors will speed the feature selec‐ tion search. --version Print program version, and exit. Input and output control options These options control how the input and output is managed by the pro‐ gram. -F filename, --log-file=filename Write debug log traces to filename. If not specified, traces are written to feature-selection.log. If the -L option is speci‐ fied, this will be used as the base of the filename. -i filename, --input-file=filename The filename specifies the input data file. The input table must be in 'delimiter-separated value' (DSV) format. Valid separa‐ tors are comma (CSV, or comma-separated values), blanks and tabs (whitespace). Columns correspond to features; there is one sam‐ ple per (non-blank) row. Comment characters are hash, bang and semicolon (#!;) lines starting with a comment are ignored. The -i flag may be specified multiple times, to indicate multiple input files. All files must have the same number of columns. -L, --log-file-dep-opt Write debug log traces to a filename constructed from the log‐ file specified in the -F command, used as a prefix, and the other option flags and their values. The filename will be trun‐ cated to a maximum of 255 characters. -l loglevel, --log-level=loglevel Specify the level of detail for debug logging. Possible values for loglevel are NONE, ERROR, WARN, INFO, DEBUG, and FINE. Case does not matter. Caution: excessive logging detail can lead to program slowdown. -o filename, --output-file=filename Write results to filename. If not specified, results are written to stdout. -u label, --target-feature=label The label is used as the target feature to fit. If none is specified, then the first column is used. The very first row of the input file, if it contains non-numeric, non-boolean values, is interpreted as column labels, as is the common practice for CSV/DSV file formats. Algorithm control options These options provide overall control over the algorithm execution. The most important of these, for controlling behavior, are the -a, -C and -T flags. -a algorithm, --algo=algorithm Select the algorithm to use for feature selection. Available algorithms include: mmi Maximal Mutual Information. This algorithm searches for the featureset with the high‐ est mutual information (MI) with regard to the target variable. It does so by adding one feature at a time to the featureset, computing the MI between the target vari‐ able and this featureset, ranking the result, and keeping only the highest-ranked results. It can be thought of as a kind-of hill-climbing in the space of mutual informa‐ tion. This process is repeated until the desired number of features is found, or until the MI score stops improv‐ ing. The maximum number of desired features must be specified with the -C option. The -T option can be used to specify the minimum desired improvement in the MI score. That is, the algorithm keeps adding features to the feature set until the improvement in the MI score does not exceed this threshold. Features are added in random order, so that if there are redundant features, only one will be added, depending on the random seed given with the -r option. Two features are considered redundant if they are highly correlated, so that adding either one of the two may improve MI a lot, but adding the second will not. Thus, only one is really needed; using the -T option helps elim‐ inate redundant features. inc Incremental, Non-Redundant Mutual Information. Builds a featureset by incrementally adding features with the highest mutual information with regard to the target. Features are accepted only if the mutual information is above a specified threshold. Features are rejected if they appear to be redundant with others: that is, if, by their presence, they fail to change the total mutual information by more than a minimum amount. One may specify either the number of features to be selected, or one may specify a general "pressure" to auto‐ matically modulate the number of features found. That is, one must specify either the -C or the -T flag, as other‐ wise, all features will be selected. hc MOSES Hillclimbing. Currently unsupported. -C num_features, --target-size=num_features Attempt to select at most num_features out of the input set. This option is ignored unless the chosen algorithm is mmi or inc. When the selected algorithm is inc, then the -T option is ignored. -T threshold, --threshold=threshold Apply a floating-point threshold for selecting a feature. A positive value prevents low-scoring features from being selected; a value of zero or less will accept all features. This option is ignored unless the chosen algorithm is mmi or inc. -r seed, --random-seed=seed Use seed as the seed value for the pseudo-random number genera‐ tor. The various algorithms use the random number generator in different ways. The mmi algorithm explicitly shuffles features, so that if the dataset contains multiple redundant features, one will be chosen randomly. Incremental algorithm options These options only apply to the -ainc algorithm. -D fraction, --int-redundant-intensity=fraction Threshold fraction used to reject redundant features. If a fea‐ ture contributes less than fraction * threshold to the total score, it will be rejected from the final feature set. That is, if two features are strongly correlated, one should be consid‐ ered redundant; as to which is de-selected will depend on the random-number generator, i.e. on the random seed specified with the -r option. -E tolerance, --inc-target-size-epsilon=tolerance To be used only with the -C option. The incremental algorithm is not able to directly select a fixed number of features; rather, it dynamically adjusts the threshold until the desired number of features results. This option controls the smallest adjustment made. -U num_terms, --inc-interaction-terms=num_terms The number of variables used in computing the joint entropy. Normally, this algorithm never computes the joint entropy of multiple features; it only considers the effect of a single fea‐ ture at a time on the target (that is, it only computes the mutual information between one feature and the target). Speci‐ fying a number greater than one will consider the mutual infor‐ mation between multiple features and the target. Note that using this calculation is combinatorially more computationally expensive, as all possible choices are considered. TODO Document the MOSES-algorithm and the options that it takes: -A -c -f -m -O -s. These are not documented because the hill-climbing algo is cur‐ rently not supported. SEE ALSO More information is available at http://wiki.opencog.org/w/Feature_selection AUTHORS feature-selection was written by Nil Geisweiller and modified by Linas Vepstas This manual page is being written by Linas Vepstas. 3.0.12 May 2, 2012 FEATURE-SELECTION(1)