Feature-selection man page

From OpenCog
Jump to: navigation, search

Man page for feature selection (raw dump).

FEATURE-SELECTION(1)           OpenCog Learning           FEATURE-SELECTION(1)



NAME
       feature-selection - dimensional reduction of datasets

SYNOPSIS
       feature-selection -h|--help
       feature-selection --version
       feature-selection  -i  filename [-a algo] [-C num_features] [-T thresh‐
       old] [-r random_seed] [-u target_feature] [-j jobs] [-F log_file]  [-L]
       [-l  loglevel]  [-o  output_file]  [-D  fraction]  [-E  tolerance]  [-U
       num_terms]

DESCRIPTION
       feature-selection is a command-line utility for performing  dimensional
       reduction  of large datasets, so as to prepare them for a later regres‐
       sion analysis.  It is intended to be used to filter data before  it  is
       fit with the moses machine-learning program, but it can be used in more
       general settings.


EXAMPLES
       feature-selection -iraw_data.csv -ammi -C10 -oreduced.csv -Ffs.log

              Use the input file raw_data.csv as the input dataset.  The input
              file must be a tab-separated or comma-separated table of values.
              The dependent feature is assumed to be in the first column.  The
              mmi  algorithm will be used to select 10 different features that
              are the most strongly correlated  with  the  dependent  feature.
              The  output  will be written to the file reduced.csv, consisting
              of the dependent feature, in the first column, followed  by  the
              10  most  predictive  columns.   A logfile of algorithm progress
              will be written to fs.log.


       feature-selection -iraw_data.csv -ammi -C10 -oreduced.csv -uQ3

              As above, but use the column labelled Q3 as the target  feature.
              This  assumes  that the input dataset contains columns labels as
              the first row in the dataset.  The selected  column  labels  are
              reproduced in the output file.


OVERVIEW
       In regression analysis, one has, in general, a table of values or ´fea‐
       tures´, and the goal of predicting one feature, the ´target´,  (located
       in  one  column of the table) based on the values of the other features
       (in other columns).  It is common that only some of  the  features  are
       useful  predictors  of  the  target,  while  the others are mostly just
       noise.  Regression analysis becomes difficult if  there  are  too  many
       input  variables, and so regression can be improved simply by filtering
       out the useless features. Doing this  is  generically  called  ´feature
       selection´  or  ´dimensional  reduction´.  This program, feature-selec‐
       tion, does this, using one of several different algorithms.

       In general, feature selection is done  by  looking  for  a  correlation
       between  the  target feature, and other features.  This correlation can
       be computed in a number of different ways;  this  program  uses  mutual
       information.   A strong correlation between the featureset and the tar‐
       get variable suggests that the featureset is the  appropriate  one  for
       performing regression analysis.

       It  can happen that two (or more) features are strongly correlated with
       each-other, and that only one of these is needed as a predictor. Adding
       any  one  of these features to the featureset will improve the correla‐
       tion considerably, but adding any of the others will have little or  no
       effect.   Such  sets of features are called ´redundant´.  A good selec‐
       tion algorithm will select only one of the features  of  the  redundant
       set, omitting the others.

       This  program  accepts  inputs  in a fairly generic whitespace or comma
       delimited table format: whitespace may  consist  of  repeated  tabs  or
       blanks;  values  may  be  comma-separated  but padded with blanks.  The
       input file may contain comment lines; these lines  must  start  with  a
       hash-mark  (#),  exclamation  mark  (!)  or semi-colon (;) as the first
       character.

       Boolean values may be indicated with any of the  characters  0,f,F  and
       1,t,T.   Column values must be either boolean or numeric; at this time,
       non-numeric data, including dates or columns containing  currency  sym‐
       bols are not supported.


OPTIONS
       Options fall into three classes: options for controlling input and out‐
       put, options  for  controlling  the  algorithm  behavior,  and  general
       options.


   General options
       -h, --help
              Print command options and exit.

       -j num, --jobs=num
              Allocate  num  threads  for  feature  selection.  Using multiple
              threads on multi-core processors will speed the  feature  selec‐
              tion search.


       --version
              Print program version, and exit.


   Input and output control options
       These  options  control how the input and output is managed by the pro‐
       gram.


       -F filename, --log-file=filename
              Write debug log traces to filename. If not specified, traces are
              written  to  feature-selection.log.   If the -L option is speci‐
              fied, this will be used as the base of the filename.

       -i filename, --input-file=filename
              The filename specifies the input data file. The input table must
              be  in  'delimiter-separated value' (DSV) format.  Valid separa‐
              tors are comma (CSV, or comma-separated values), blanks and tabs
              (whitespace).  Columns correspond to features; there is one sam‐
              ple per (non-blank) row. Comment characters are hash,  bang  and
              semicolon  (#!;) lines starting with a comment are ignored.  The
              -i flag may be specified multiple times,  to  indicate  multiple
              input files. All files must have the same number of columns.

       -L, --log-file-dep-opt
              Write  debug  log traces to a filename constructed from the log‐
              file specified in the -F command, used  as  a  prefix,  and  the
              other option flags and their values.  The filename will be trun‐
              cated to a maximum of 255 characters.

       -l loglevel, --log-level=loglevel
              Specify the level of detail for debug logging.  Possible  values
              for  loglevel are NONE, ERROR, WARN, INFO, DEBUG, and FINE. Case
              does not matter.  Caution: excessive logging detail can lead  to
              program slowdown.

       -o filename, --output-file=filename
              Write results to filename. If not specified, results are written
              to stdout.

       -u label, --target-feature=label
              The label is used as the target feature  to  fit.   If  none  is
              specified, then the first column is used.  The very first row of
              the input file, if it contains non-numeric, non-boolean  values,
              is  interpreted  as column labels, as is the common practice for
              CSV/DSV file formats.

   Algorithm control options
       These options provide overall control  over  the  algorithm  execution.
       The  most  important of these, for controlling behavior, are the -a, -C
       and -T flags.


       -a algorithm, --algo=algorithm
              Select the algorithm to use for  feature  selection.   Available
              algorithms include:

              mmi   Maximal Mutual Information.

                    This  algorithm searches for the featureset with the high‐
                    est mutual information (MI)  with  regard  to  the  target
                    variable.   It  does so by adding one feature at a time to
                    the featureset, computing the MI between the target  vari‐
                    able  and this featureset, ranking the result, and keeping
                    only the highest-ranked results.  It can be thought of  as
                    a  kind-of  hill-climbing  in the space of mutual informa‐
                    tion.  This process is repeated until the  desired  number
                    of  features is found, or until the MI score stops improv‐
                    ing.

                    The maximum number of desired features must  be  specified
                    with  the -C option.  The -T option can be used to specify
                    the minimum desired improvement in the MI score.  That is,
                    the  algorithm  keeps  adding  features to the feature set
                    until the improvement in the MI score does not exceed this
                    threshold.  Features are added in random order, so that if
                    there are redundant features,  only  one  will  be  added,
                    depending on the random seed given with the -r option.

                    Two  features  are considered redundant if they are highly
                    correlated, so that adding  either  one  of  the  two  may
                    improve  MI  a lot, but adding the second will not.  Thus,
                    only one is really needed; using the -T option helps elim‐
                    inate redundant features.

              inc   Incremental, Non-Redundant Mutual Information.

                    Builds  a featureset by incrementally adding features with
                    the highest mutual information with regard to the  target.
                    Features  are  accepted  only if the mutual information is
                    above a specified threshold. Features are rejected if they
                    appear  to be redundant with others: that is, if, by their
                    presence, they fail to change the total mutual information
                    by more than a minimum amount.

                    One  may  specify  either  the  number  of  features to be
                    selected, or one may specify a general "pressure" to auto‐
                    matically modulate the number of features found.  That is,
                    one must specify either the -C or the -T flag,  as  other‐
                    wise, all features will be selected.


              hc    MOSES Hillclimbing. Currently unsupported.

       -C num_features, --target-size=num_features
              Attempt to select at most num_features out of the input set.

              This  option  is  ignored  unless the chosen algorithm is mmi or
              inc.  When the selected algorithm is inc, then the -T option  is
              ignored.


       -T threshold, --threshold=threshold
              Apply  a  floating-point  threshold  for selecting a feature.  A
              positive  value  prevents  low-scoring   features   from   being
              selected; a value of zero or less will accept all features.

              This  option  is  ignored  unless the chosen algorithm is mmi or
              inc.


       -r seed, --random-seed=seed
              Use seed as the seed value for the pseudo-random number  genera‐
              tor.   The various algorithms use the random number generator in
              different ways.  The mmi algorithm explicitly shuffles features,
              so that if the dataset contains multiple redundant features, one
              will be chosen randomly.


   Incremental algorithm options
       These options only apply to the -ainc algorithm.


       -D fraction, --int-redundant-intensity=fraction
              Threshold fraction used to reject redundant features. If a  fea‐
              ture  contributes  less  than  fraction * threshold to the total
              score, it will be rejected from the final feature set.  That is,
              if  two  features are strongly correlated, one should be consid‐
              ered redundant; as to which is de-selected will  depend  on  the
              random-number  generator, i.e. on the random seed specified with
              the -r option.


       -E tolerance, --inc-target-size-epsilon=tolerance
              To be used only with the -C option.  The  incremental  algorithm
              is  not  able  to  directly  select  a fixed number of features;
              rather, it dynamically adjusts the threshold until  the  desired
              number  of  features  results. This option controls the smallest
              adjustment made.


       -U num_terms, --inc-interaction-terms=num_terms
              The number of variables used in  computing  the  joint  entropy.
              Normally,  this  algorithm  never  computes the joint entropy of
              multiple features; it only considers the effect of a single fea‐
              ture  at  a  time  on  the target (that is, it only computes the
              mutual information between one feature and the target).   Speci‐
              fying  a number greater than one will consider the mutual infor‐
              mation between multiple features  and  the  target.   Note  that
              using  this  calculation is combinatorially more computationally
              expensive, as all possible choices are considered.

TODO
       Document the MOSES-algorithm and the options that it takes: -A -c -f -m
       -O -s.  These are not documented because the hill-climbing algo is cur‐
       rently not supported.


SEE ALSO
       More          information           is           available           at
       http://wiki.opencog.org/w/Feature_selection

AUTHORS
       feature-selection  was written by Nil Geisweiller and modified by Linas
       Vepstas

       This manual page is being written by Linas Vepstas.



3.0.12                            May 2, 2012             FEATURE-SELECTION(1)