MOSES man page

From OpenCog
Jump to: navigation, search

Man page for MOSES (raw dump).

MOSES(1)                         OpenCog Learning                         MOSES(1)

NAME
       moses - meta-optimizing semantic evolutionary search solver

SYNOPSIS
       moses -h|--help
       moses --version
       moses   [-H   problem_type]   [-i   input_filename]   [-u   target_feature]
       [--enable-fs=1] [-Y ignore_feature] [-m max_evals] [--max-time seconds] [-f
       logfile]  [-l loglevel] [-r random_seed] [--mpi=1] [-n ignore_operator] [-N
       use_only_operator] [-o output_file] [--boost] [-A max_score] [-a algo]  [-B
       knob_effort]  [-b nsamples] [-C1] [-c result_count] [-D max_dist] [-d1] [-E
       reduct_effort] [-e exemplar] [-F] [-g max_gens] [-I0] [-j jobs]  [-k  prob‐
       lem_size]   [-L1]   [-P   pop_size_ratio]   [-p  probability]  [--python=1]
       [--python3=1] [-Q hardness] [-q min] [-R threshold] [-S0] [-s]  cache_size]
       [--score-weight  column]  [-T1] [-t1] [-V1] [-v temperature] [-W1] [-w max]
       [--well-enough=1] [-x1] [-y combo_program] [-Z1] [-z complexity_ratio]

DESCRIPTION
       moses is the command-line wrapper to the MOSES program learning library. It
       may  be  used to solve learning tasks specified with file inputs, or to run
       various demonstration problems.  Given an input table of  values,  it  will
       perform a regression, generating, as output, a combo program that best fits
       the data. moses is multi-threaded, and can be distributed  across  multiple
       machines to improve run-time performance.

EXAMPLE
       As an example, the input file of comma-separated values:

        a,b,out
        0,0,0
        0,1,0
        1,0,0
        1,1,1

       when  run  with  the  flags -Hit -u out -W1 will generate the combo program
       and($a $b). This program indicates that the third column can be modelled as
       the  boolean-and  of the first two.  The -u option specifies that it is the
       third column that should be modelled, and the -W1 flag indicates that  col‐
       umn labels, rather than column numbers, should be printed on output.

OVERVIEW
       moses  implements  program learning by using a meta-optimization algorithm.
       That is, it uses two optimization algorithms, one wrapped inside the other,
       to find solutions.  The outer optimization algorithm selects candidate pro‐
       grams (called exemplars), and then creates similar programs  by  taking  an
       exemplar  and inserting variables (called knobs) in selected locations. The
       inner optimization algorithm then searches for the best possible 'knob set‐
       tings',  returning  a set of programs (a deme) that provide the best fit to
       the data. The outer optimization algorithm then selects new candidates, and
       performs  a simplification step, reducing these to a normal form, to create
       new exemplars.  The process is then  repeated  until  a  perfect  score  is
       reached, a maximum number of cycles is performed, or forward progress is no
       longer being made.

       Program reduction is performed at two distinct stages: after inserting  new
       knobs,  and  when  candidates are selected.  Reduction takes a program, and
       transforms it into a new form, the so-called elegant  normal  form,  which,
       roughly  speaking, expresses the program in the smallest, most compact form
       possible.  Reduction is an important step of the algorithm, as it  prevents
       the  evolved programs from turning into 'spaghetti code', avoiding needless
       complexity that can cost during program evaluation.

       The operation of moses can be modified by using a large variety of  options
       to  control  different  parts of the above meta-algorithm.  The inner opti‐
       mization step can be  done  using  one  of  several  different  algorithms,
       including  hillclimbing,  simulated annealing, or univariate Bayesian opti‐
       mization.  The amount of effort spent in the inner loop can be limited,  as
       compared  to  how  frequently new demes and exemplars are chosen. The total
       number of demes or exemplars maintained  in  the  population  may  be  con‐
       trolled.   The  effort expended on program normalization at each of the two
       steps can be controlled.

       In addition to controlling the operation of the algorithm, many options are
       used  to describe the format of the input data, and to control the printing
       of the output.

       moses also has a 'demo' mode: it has a number of built-in scoring functions
       for  various  'well-known'  optimization problems, such as parity, disjunc‐
       tion, and multiplex. Each of these typically presents a  different  set  of
       challenges to the optimization algorithm, and are used to illustrate 'hard'
       optimization problems.  These demo problems are useful  both  for  learning
       and understanding the use of moses, and also for developers and maintainers
       of moses itself, when testing new features and enhancements.

COMBO PROGRAMS
       This section provides a brief overview of the combo  programming  language.
       Combo is not meant so much for general use by humans for practical program‐
       ming, as it is to fit the needs of program learning.  As such, it is fairly
       minimal, while still being expressive enough so that all common programming
       constructs are simply represented in combo.  It is not difficult to convert
       combo programs into other languages, or to evaluate them directly.

       In  combo,  all  programs  are  represented as program trees. That is, each
       internal node in the tree is an operation and leaves are  either  constants
       or variables. The set of all of the variables in a program are taken as the
       arguments to the program. The program is evaluated  by  supplying  explicit
       values  for the variables, and then applying operations until the root node
       is reached.

       By convention, variables are denoted by a dollar-sign followed by a number,
       although  they  may also be named. Variables are not explicitly typed; they
       are only implicitly typed during program evaluation. Thus, for example,  or
       ($1 $2) is a combo program that represents the boolean-or of two inputs, $1
       and $2.   Built-in constants include true and false; numbers may be written
       with or without a decimal point.

       Logical operators include and, or and not.

       Arithmetic operators include +, - * and / for addition, subtraction, multi‐
       plication and division. Additional arithmetic operators include  sin,  log,
       exp  and  rand.  The rand operator returns a value in the range from 0.0 to
       1.0.  The predicate operator 0< (greater than zero) takes a single continu‐
       ous-valued expression, and returns a boolean. The operator impulse does the
       opposite: it takes a single boolean-valued expression, and returns  0.0  or
       1.0, depending on whether the argument evaluates to false or true.

       The  cond  operator takes an alternating sequence of predicates and values,
       terminated by an 'else' value.  It evaluates as  a  chain  of  if-then-else
       statements,  so  that it returns the value following the first predicate to
       evaluate to true.

       Thus, as an example, the expression and( 0<( +($x 3))  0<($y)) evaluates to
       true if $x is greater than -3 and $y is greater than zero.

       The  above  operators are built-in; combo may also be extended with custom-
       defined operators, although C++ programming and recompilation  is  required
       for this.

OPTIONS
       Options  fall  into  four classes: those for specifying inputs, controlling
       operation, printing output, and running moses in demo mode.

   General options
       -e exemplar
              Specify an initial exemplar to use.  Each  exemplar  is  written  in
              combo. May be specified multiple times.

       -h, --help
              Print command options, and exit.

       -j num, --jobs=num
              Allocate  num threads for deme optimization.  Using multiple threads
              will speed the search on a multi-core machine.  This flag  may  also
              be  used  to  specify  remote execution; see the section Distributed
              processing below.

       --version
              Print program version, and exit.

   Problem-type options
       moses is able to handle a variety of different  'problem  types',  such  as
       regression,  categorization  and  clustering,  as  well as a number of demo
       problems, such as parity and factorization.  The -H option is used to spec‐
       ify the problem type; the demo problem types are listed in a later section.
       This section lists the various "table" problems, where the goal is to train
       moses on an input table of values.

       -H type, --problem-type=type
              The type of problem may be one of:

              it       Regression  on  an  input  table.  That is, the input table
                       consists of a  set  of  columns,  all  but  one  considered
                       'inputs',  and  one  is  considered an output.  The goal of
                       regression is to learn a combo program that most accurately
                       predicts  the  output.   For  boolean-valued and enumerated
                       outputs, the scoring function simply counts the  number  of
                       incorrect  answers, and tries to minimize this score.  That
                       is, this scorer  attempts  to  maximize  accuracy  (defined
                       below).  For  contin-valued outputs, the mean-square varia‐
                       tion is minimized.

              pre      Regression on an input table, maximizing precision, instead
                       of  accuracy (that is, minimizing the number of false posi‐
                       tives, at the risk of sometimes failing  to  identify  true
                       positives).   Maximization is done while holding activation
                       constant.   This  scorer  is  ideal  for  learning  "domain
                       experts", which provide a result only when they are certain
                       that the result is correct.  So, for a  binary  classifier,
                       the  output is meant to be interpreted as "yes, certain" or
                       "no, don't know".  An ensemble of such combo expressions is
                       commmonly called a "mixture of experts".

              prerec   Regression  on  an input table, maximizing precision, while
                       attempting to maintain the recall (sensitivity) at or above
                       a given level.

              recall   Regression  on an input table, maximizing recall (sensitiv‐
                       ity) while attempting to maintain the precision at or above
                       a  given  level.  This scorer is most commonly used when it
                       is important to guarantee a  certain  level  of  precision,
                       even  if  it  means  rejecting most events. In medicine and
                       physics/radio applications,  recall  is  exactly  the  same
                       thing  as  sensitivity:  this  option searches for the most
                       sensitive test while holding to a minimum level  of  preci‐
                       sion.

              bep      Regression  on  an  input  table, maximizing the arithmetic
                       mean of the precision and recall, also known as the "break-
                       even point" or BEP.

              f_one    Regression  on an input table, maximizing the harmonic mean
                       of the precision and recall, that is, the F_1 score.

              select   Regression on an input table, selecting  a  range  of  rows
                       from  a  continuous-valued distribution.  The range of rows
                       selected are expressed as  percentiles.  That  is,  if  the
                       input table has N rows, then rows that are selected will be
                       rows N*lower_percentile through  N*upper_percentile,  where
                       the  rows  are ordered in ascending rank of the output col‐
                       umn.  That is,  this  scorer  will  rank  (sort)  the  rows
                       according  to the output column, and then select those only
                       in the indicated range.

                       The lower and upper percentile can be sepcified with the -q
                       and    the    -w    options   (the   --min-rand-input   and
                       --max-rand-input options), and should lie between  0.0  and
                       1.0.

              ip       Discovery of "interesting predicates" that select rows from
                       the input table. The data table is assumed to consist of  a
                       number of boolean-valued input columns, and a contin-valued
                       (floating point) target column. moses will learn predicates
                       that  select  the  most "interesting" subset of the rows in
                       the table.   The  values  in  the  output  columns  of  the
                       selected  rows  form a probability distribution (PDF); this
                       PDF is considered to be "interesting"  if  it  maximizes  a
                       linear combination of several different measures of the the
                       PDF: the Kullback-Leibler divergence, the skewness, and the
                       standardized Mann-Whitney U statistic.

              kl       Regression  on  an input table, by maximizing the Kullback-
                       Leibler divergence between the distribution of the outputs.
                       That  is,  the  output must still be well-scored, but it is
                       assumed that there are many possible maxima.  (XXX???) Huh?

              ann-it   Regression on an  input  table,  using  a  neural  network.
                       (kind-of-like  a  hidden  Markov  model-ish,  kind  of. XXX
                       Huh???)

       For boolean-valued data tables, the scorers make use of the following  gen‐
       erally-accepted, standard defintions:

       TP     True-positive:  the  sum  of all rows for which the combo model pre‐
              dicts T for a data-table row marked T.

       FP     False-positive: the sum of all rows for which the combo  model  pre‐
              dicts T for a data-table row marked F.

       FN     False-negative:  the  sum of all rows for which the combo model pre‐
              dicts F for a data-table row marked T.

       TN     True-negative: the sum of all rows for which the  combo  model  pre‐
              dicts F for a data-table row marked F.

       accuracy
              Defined as the formula (TP + TN) / (TP + TN + FP + FN)

       activation
              Defined as the formula (TP + FP) / (TP + TN + FP + FN)

       precision
              Defined as the formula TP / (TP + FP)

       recall Also known as sensitivity, this is defined as the formula TP / (TP +
              FN)

       F_1    Harmonic mean of precision and recall, defined as the formula  (2  *
              precision * recall) / (precision + recall) = 2TP / (2TP + FP + FN)

       bep    Break-even  point,  the  arithmetic  mean  of  precision and recall,
              defined as the formula (precision + recall) / 2

   Input specification options
       These options control how input data is specified and interpreted.  In  its
       primary  mode of operation, moses performs regression on a a table of input
       data. One column is designated as the target,  the  remaining  columns  are
       taken as predictors.  The output of regression is a combo program that is a
       function of the predictors, reproducing the target.

       Input files should consist of ASCII data, separated  by  commas  or  white‐
       space.   The  appearance  of # ; or ! in the first column denotes a comment
       line; this line will be ignored. The first non-comment row, if it  is  also
       non-numeric, is taken to hold column labels. The target column may be spec‐
       ified using the -u option with a column name. The printing of column  names
       on output is controlled with the -W1 flag.

       -i filename, --input-file=filename
              The  filename specifies the input data file. The input table must be
              in 'delimiter-separated value' (DSV) format.  Valid  separators  are
              comma  (CSV,  or  comma-separated  values),  blanks and tabs (white‐
              space). Columns correspond to features;  there  is  one  sample  per
              (non-blank)  row.  Comment  characters  are hash, bang and semicolon
              (#!;) lines starting with a comment are ignored.  The -i flag may be
              specified  multiple  times,  to  indicate  multiple input files. All
              files must have the same number of columns.

       -u column, --target-feature=column
              The column is used as the target feature to fit.  If  no  column  is
              specified,  then  the  first  column  is  used.   The  column may be
              numeric, or it may be a column label.  If it is numeric, it is taken
              to  be  the number of the column, with column 1 being the left-most.
              If column begins with an alphabetic character, it is taken to  be  a
              column  label.  In this case, the first non-comment row of the input
              file must contain column labels.

       -Y column, --ignore-feature=column
              The column should be ignored, and not used as  input.   Columns  are
              specified  as  above.   This  option  may be used multiple times, to
              ignore multiple columns.

       --score-weight=column
              The column is used to weight the score for each row. If this  option
              is  not  used, then each row in the table contributes equally to the
              evaluation of the accuracy of the learned model.  However,  if  some
              rows are more important than others to get right, this option can be
              used to indicate those rows. The  accuracy  of  the  model  will  be
              weighted  by  this  number,  when evaluating the score.  A weight of
              zero effectively causes the row to be  ignored.  A  negative  weight
              enourages  the  system  to  learn models that get the row incorrect.
              For boolean problems, this is the same as flipping the output value.
              This option can only be used once, and, if used, it should specify a
              column containing an integer or floating-point value.

       -b num, --nsamples=num
              The number of samples to be taken from the input file. Valid  values
              run  between 1 and the number of rows in the data file; other values
              are ignored. If this option is absent, then all data rows are  used.
              If  this option is present, then the input table is sampled randomly
              to reach this size.

       -G1, --weighted-accuracy=1
              This option is only  used  for  the  discretize_contin_bscore  (when
              --discretize-threshold  is  used), if enabled, then the score corre‐
              sponds to weighted accuracy Useful in case of unbalanced data.

       --balance=1
              If the table has discrete output type (like bool or  enum),  balance
              the resulting ctable so all classes have the same weight.

   Algorithm control options
       These  options  provide  overall control over the algorithm execution.  The
       most important of these, for controlling behavior,  are  the  -A,  -a,  -m,
       --max-time, -r, -v and -z flags.

       -a algorithm, --algo=algorithm
              Select  the  algorithm to apply to a single deme.  This is the algo‐
              rithm used in the 'inner loop': given a  single  exemplar  decorated
              with  tunable  knobs,  this algorithm searches for the best possible
              knob settings.  Once these are found (or a timeout, or other  termi‐
              nation condition is reached), control is returned to the outer opti‐
              mization loop.  Available algorithms include:

              hc   Hill-climbing. There are two primary modes of  operation;  each
                   has  strengths  and weaknesses for different problem types.  In
                   the default mode, one begins with an initial collection of knob
                   settings, called an instance. The settings of each knob is then
                   varied, in turn, until one setting is found that most  improves
                   the  score. This setting then becomes the new instance, and the
                   process is repeated, until no further improvement is seen.  The
                   resulting  instance  is  a local maximum; it is returned to the
                   outer loop.

                   The alternate mode of operation is triggered by using  the  -L1
                   flag  (usually with the -T1 flag). In this case, as before, all
                   knob settings are explored, one knob at a time.  After  finding
                   the  one  knob  that most improves the score, the algo is done,
                   and the resulting instance is returned to the outer loop. If no
                   knob settings improved the score, then all possible settings of
                   two knobs are explored, and then three, etc. until  improvement
                   is  found  (or  the allotted iterations are exceeded).  In this
                   alternate mode, the local hill  is  not  climbed  to  the  top;
                   instead,  any  improvement  is  immediately  handed back to the
                   outer loop, for another round of exemplar selection  and  knob-
                   building.   For  certain types of problems, including maximally
                   misleading problems, this can arrive at better solutions,  more
                   quickly, than the traditional hill-climbing algorithm described
                   above.

              sa   Simulated annealing.  (Deprecated). The -D  flag  controls  the
                   size  of  the  neighborhood  that is searched during the early,
                   "high-temperature" phase.  It has a significant effect  on  the
                   run-time  performance  of  the  algorithm.  Using -D2 or -D3 is
                   likely to provide the best performance.

                   The current  implementation  of  this  algorithm  has  numerous
                   faults, making it unlikely to work well for most problems.

              un   Univariate Bayesian dependency.

       -A score, --max-score=score
              Specifies  the ideal score for a desired solution; used to terminate
              search.  If the maximum number of evaluations has  not  yet  elapsed
              (set with the -m option), and a candidate solution is found that has
              at least this score, then search is terminated.

       -m num, --max-evals=num
              Perform no more  than  num  evaluations  of  the  scoring  function.
              Default value is 10000.

       --max-time= secs
              Run the optimizer for no longer than secs seconds.  Note that timing
              is polled only in a small number of points in the  algorithm;  thus,
              actual  execution  time might exceed this limit by a few seconds, or
              even many minutes, depending on the problem  type.   In  particular,
              knob-building  time  is not accounted for right away, and thus prob‐
              lems with a long knob-building time  will  exceed  this  limit.   If
              using  this  option, be sure to set the -m option to some very large
              value.  Default value is 42 years.

       -n oper, --ignore-operator=oper
              Exclude the operator oper from the program  solution.   This  option
              may  be used several times.  Currently, oper may be one of div, sin,
              exp, log, impulse or a variable #n.  You may need to  put  variables
              under  double  quotes.   This  option  has  the priority over the -N
              option.  That is, if an operator is both be  included  and  ignored,
              then it is ignored.  This option does not work with ANN.

       --linear-regression=1
              When attempting to fit continuous-valued features, restrict searches
              to linear expressions only; that is, do not use polynomials  in  the
              fit.   Specifying this option also automatically disables the use of
              div, sin, exp and log.  Note that polynomial regression  results  in
              search spaces that grow combinatorially large in the number of input
              features; That is, for N features, a quadratic  search  will  entail
              O(N^2)  possibilities, a cubic search will explore O(N^3) possibili‐
              ties, and so on.  Thus, for any problem with more than dozens  or  a
              hundred features, linear regression is recommended.

       -r seed, --random-seed=seed
              Use seed as the seed value for the pseudo-random number generator.

       -v temperature, --complexity-temperature=temperature
              Set  the  "temperature"  of  the Boltzmann-like distribution used to
              select the next exemplar out of the  metapopulation.  A  temperature
              that  is too high or too low will make it likely that poor exemplars
              will be chosen for exploration, thus resulting in  excessively  long
              search  times.   The recommended temperature depends strongly on the
              type of problem being solved.  If it is known that the  problem  has
              false  maxima, and that the distance from the top of the false maxi‐
              mum to the saddle separating the false and true maximum is  H,  then
              the recommended temerature is 30*H.  Here, H is the 'height' or dif‐
              ference in score from false peak to saddle,  the  saddle  being  the
              highest  mountain  pass  between the false and true minumum. Varying
              the temperature by a factor of 2 or 3 from this value  won't  affect
              results  much.   Too  small a temperature will typically lead to the
              system getting trapped at a local maximum.

              The demo parity problem works well with a temperature of  5  whereas
              the demo Ant trail problem requies a temperature of 2000.

       -z ratio, --complexity-ratio=ratio
              Fix  the ratio of score to complexity, to be used as a penalty, when
              ranking the metapopulation for fitness. This ratio is  meant  to  be
              used  to  limit the size of the search space, and, when used with an
              appropriate temperature, to avoid gettting trapped in local maxima.

              Roughly speaking, the size of the search  space  increases  exponen‐
              tially  with  the complexity of the combo trees being explored: more
              complex trees means more of them need to be examined.  However, more
              complex  trees typically result in higher scores.  If an increase of
              N bits in the complexity typically leads to an increase of s  points
              of  the score, then the complexity ratio should be set to about N/s.
              In this way, the exploration of more complex tree is penalized by an
              amount  roughly comparable to the chance that such complicated trees
              actually provide a better solution.

              The complexity ratio is used to calculate  a  scoring  penalty;  the
              penalty  lowers  the score in proportion to the solution complexity;
              specifically, the penalty is set to the complexity  divided  by  the
              complexity ratio.

              Setting  the  ratio  too low causes the algorithm to ignore the more
              complex solutions, ranking them in a way so that they are  not  much
              explored.  Thus,  the  algorithm  may get trapped examining only the
              simplest solutions, which are probably inappropriate.

              Setting this ratio too high will prevent a good solution from  being
              found.  In such cases, the algorithm will spend too much time evalu‐
              ating overly-complex  solutions,  blithly  leaving  simpler,  better
              solutions unexplored.

              The  relationship between the score change and the complexity change
              is very strongly data-dependent, and must  (currently)  be  manually
              determined  (although  it  might be possible to measure it automati‐
              cally).  Input data tables with lots of  almost-duplicate  data  may
              have  very  low  ratios;  complex problems with sparse data may have
              very high ratios.  Initial recommended values would be in the  range
              from 1 to 5; with 3.5 as the default.  The parity demo problem works
              well with the 3.5 default, the Ant trail  demo  problem  works  well
              with 0.16.

       -Z1, --hc-crossover=1
              Controls  hill-climbing algorithm behavior.  If false (the default),
              then the entire local neighborhood of the current center instance is
              explored.  The  highest-scoring  instance  is then chosen as the new
              center instance, and the process is repeated.   For  many  datasets,
              however, the highest-scoring instances tend to cluster together, and
              so an exhaustive search may not be required.  When  this  option  is
              specified,  a  handful of the highest-scoring instances are crossed-
              over (in the genetic sense of cross-over) to create  new  instances.
              Only  these are evaluated for fitness; the exhaustive search step is
              skipped.  For many problem types, especially those with large neigh‐
              borhoods (i.e. those with high program complexity), this can lead to
              an order-of-magnitude speedup, or more.  For  other  problem  types,
              especially  those  with  deceptive  scoring functions, this can hurt
              performance.

       --hc-crossover-min-neighbors=num
              If crossover is enabled, and an instance is smaller than this  size,
              then  an  exhaustive  search  of the neighborhood will be performed.
              Otherwise, the search will be limited to a cross-over of the highest
              scoring  instances.   Exhaustive searches are more accurate, but can
              take much longer, when instances are large.  Recommended values  for
              this option are 100 to 500. Default is 400.

       --hc-crossover-pop-size=num
              If  crossover  is  enabled,  this  controls the number of cross-ever
              instances that are created.  A few hundred is normally sufficient to
              locate contain the actual maximum. Default value is 120.

       --boost=1
              Enables  the use of boosting. Currently, only the AdaBoost algorithm
              is implemented.  If boosting is enabled, then the system focuses  on
              learning  combo  programs  that correctly classify those values that
              were mis-classified in the previous round of learning.   The  output
              is a set of weighted combo trees.

              At this time, boosting can only be used to learn binary classifiers:
              that is, to learn problems in which the scorer can provide a  yes/no
              answer.

   Boosted dynamic feature selection
       Problems  with  a  large  number  of input features (typically, hundreds or
       more) can lead to excessively long run-times, and overwhelming  amounts  of
       memory  usage.  Such problems can be tamed in two different ways: by static
       feature pre-selection (i.e. by giving moses fewer features to work with) or
       by  reducing  the  number  of  features considered during the knob-building
       stage.

       Recall that, as moses runs, it iterates over two loops: an outer loop where
       an  exemplar is selected and decorated with knobs, and an inner loop, where
       the knobs are 'turned', searching for the best  knob-settings.   Each  knob
       corresponds  to  one feature added to a particular location in the exemplar
       tree.  A single, given feature will then be used to build a number of  dif‐
       ferent  knobs, each in a different place in the tree.  As trees get larger,
       there are more places in each tree that can be decorated with knobs;  thus,
       the  number  of knobs, and the search space, increases over time.  If there
       are M places in the tree that can be decorated, and there are  N  features,
       then  M*N  knobs  will  be created.  The search space is exponential in the
       number of knobs; so, e.g. for boolean problems,  three  knob  settings  are
       explored:  present,  absent  and  negated,  leading  to  a  search space of
       3^(M*N).  Worse, the resulting large trees also take longer to simplify and
       evaluate.   Clearly,  limiting  the  number  of knobs created can be a good
       strategy.

       This can be done with dynamic, boosted feature selection.  When enabled,  a
       feature-selection  step  is performed, to find those features that are most
       likely to improve on the examplar.  These are the  features  that  will  be
       strongly  (anti-)correlated  with  incorrect  answers  from  the currently-
       selected exemplar.  By using only these features during knob building,  the
       total  number  of knobs can be sharply decreased.  The resulting size ofthe
       decorated tree is smaller, and the total search space is much smaller.

       Dynamic, boosted feature selection differs from static  feature  pre-selec‐
       tion  in  many  important  ways.  In dynamic selection, the total number of
       features available to moses is not reduced: just because a feature was  not
       used  in  a  given  round of knob-decoration does not mean that it won't be
       used next time.  Dynamic feature selection also resembles boosting, in that
       it  focuses  on  fixing the wrong answers of the current exemplar.  By con‐
       trast, static pre-selection  permanently  discards  features,  making  them
       unavailable  to  moses; it is also unable to predict which features are the
       most likely to be useful during iteration.

       Boosted, dynamic, feature  selection  is  enabled  with  the  --enable-fs=1
       option.   The  number  of features to use during knob building is specified
       using the --fs-target-size option.  A number of  additional  flags  control
       the  behaviour  of  the  feature  selection  algorithm; these are best left
       alone; the defaults should be adequate for almost all  problems.   The  man
       page for the feature-selection command describes these in greater detail.

       --enable-fs=1
              Enable  integrated feature selection.  Feature selection is disabled
              by default.

       --fs-target-size=num
              Select num features for use.  This argument is mandatory if  feature
              selection is enabled.

       --fs-algo
              Choose  the  feature-selection algorithm.  Possible choices are sim‐
              ple, inc, smd, hc and random.  The default value is simple, which is
              the  fastest and best algo.  The simple algorithm computes the pair-
              wise mutual information (MI) between the target  and  each  feature,
              and  then  selects the list of num features with the highest MI.  It
              is strongly recommended that this algo be used.  The  smd  algorithm
              implements  "stochastic  mutual  dependency",  by  computing  the MI
              between a candidate featureset and the  target.   Starting  with  an
              initially  empty featureset, features are randomly added one-by-one,
              with the MI then computed.  Only the featureset with the highest  MI
              is  kept;  the process is then repeated until the featureset has the
              desired number of features in it, or the MI has stopped  increasing.
              Note  that the smd algorithm effectively prevents redundant features
              from being added to the featureset.  Note that smd  runs  orders  of
              magnitude  more  slowly  than  simple, and probably does not provide
              better results.  The inc is similar to the smd algo, except that  it
              adds  many  features,  all  at  once,  to  the  featureset, and then
              attempts to find the redundant features in the featureset,  removing
              those. This is iteratively repeated until the desired number of fea‐
              tures is obtained.  Note that inc  runs  orders  of  magnitude  more
              slowly  than  simple,  and probably does not provide better results.
              The hc algorithm runs moses hill-climbing to discover those features
              most  likely  to appear.  The random algorithm selects features ran‐
              domly.  It is useful only for limiting the numberof  knobs  created,
              but not otherwise slantnig the choice of features.

       --fs-threshold=num
              Set the minimum threshold for selecting a feature.

       --fs-inc-redundant-intensity=fraction
              When  using the inc algorithm, set the threshold to reject redundant
              features.

       --fs-inc-target-size-epsilon=tolerance
              When using the inc algorithm, set the smallest step size used.

       --fs-inc-interaction-terms=num_terms
              When using the inc algorithm, set the number of terms used when com‐
              puting the joint entropy.

       --fs-hc-max-score
              TODO write description

       --fs-hc-confidence-penalty-intensity
              TODO write description

       --fs-hc-max-evals
              TODO write description

       --fs-hc-fraction-of-remaining
              TODO write description

   Large problem parameters
       Problems with a large number of features (100 and above) often evolve exem‐
       plars with a complexity of 100 or more, which in turn  may  have  instances
       with  hundreds  of  thousands  of  knobs, and thus, hundreds of thouands of
       nearest neighbors.  Exploring one nearest neighbor requires one  evaluation
       of the scoring function, and so an exhaustive search can be prohibitive.  A
       partial search can often work quite well,  especially  when  cross-over  is
       enabled.  The following flags control such partial searches.

       Note,  however, that usually one can obtain better results by using dynamic
       feature selection, instead of using the options below to limit the  search.
       The  reason  for this is that the options below cause the search to be lim‐
       ited in a random fashion: the knobs to turn are selected randomly,  with  a
       uniform  distribution,  without any guidance as to whether they might mae a
       difference.  By contrast, dynamic feature selection also limits the  search
       space, by creating knobs only for those features most likely to make a dif‐
       ference.  Unless the scoring function is particularly  decietful,  limiting
       the  search to the likely directions should perform better than limiting it
       to a random subset.

       --hc-max-nn-evals=num
              Controls hill-climbing algorithm behavior.  When exploring the near‐
              est neighborhood of an instance, num specifies the maximum number of
              nearest neighbors to explore.  An exhaustive search of  the  nearest
              neighborhood  is  performed  when the number of nearest neighbors is
              less than this value.

       --hc-fraction-of-nn=frac
              Controls hill-climbing  algorithm  behavior.    When  exploring  the
              nearest neighborhood of an instance,  frac specifies the fraction of
              nearest neighborhood to explore.  As currently implemented, only  an
              estimate  of  the  nearest-neighborhood  size  is used, not the true
              size.  However, this estimate is accurate to within a factor  of  2.
              Thus, to obtain an exhaustive search of the entire neighborhood, set
              this to 2.0 or larger.

   Algorithm tuning options
       These options allow the operation of the algorithm  to  be  fine-tuned  for
       specific  applications.   These are "advanced" options; changing these from
       the default is likely to worsen algorithm behavior in all but certain  spe‐
       cial cases.

       -B effort, --reduct-knob-building-effort=effort
              Effort  allocated  for  reduction  during  the  knob-building stage.
              Valid values are in the range  0-3,  with  0  standing  for  minimum
              effort,  and  3  for  maximum effort. Larger efforts result in demes
              with fewer knobs, thus lowering the overall dimension of  the  prob‐
              lem.  This  can improve performance by effectively reducing the size
              of the problem.  The default effort is 2.

       -Ddist, --max-dist=dist
              The maximum radius  of  the  neighborhood  around  the  exemplar  to
              explore.  The default value is 4.

       -d1, --reduce-all=1
              Reduce  candidates  before scoring evaluation. Otherwise, only domi‐
              nating candidates are  reduced,  just  before  being  added  to  the
              metapopulation.  This flag may be useful if scoring function evalua‐
              tion expense depends strongly one the structure of the candidate. It
              is  particularly  important to specify this flag when memoization is
              enabled (with -s1).

       -E effort, --reduct-candidate-effort=effort
              Effort allocated for reduction of candidates. Valid  values  are  in
              the range 0-3, with 0 standing for minimum effort, and 3 for maximum
              effort. For certain very symmetric problems, such  as  the  disjunct
              problem,  greater  reduction  can lead to significantly faster solu‐
              tion-finding.  The default effort is 2.

       -g num, --max-gens=num
              Create and optimize no more than num demes.   Negative  numbers  are
              interpreted  as  "unlimited".  By  default,  the  number of demes is
              unlimited.

       --discard-dominated=1
              If specified, the "dominated" members of the metapopulation will  be
              discarded.   A member of the metapopulation is "dominated" when some
              existing member of the metapopulation scores better on *every*  sam‐
              ple  in  the scoring dataset. Naively, one might think that an indi‐
              vidual that does worse, in every possible way, is useless,  and  can
              be  safely thrown away.  It turns out that this is a bad assumption;
              dominated individuals, when selected for deme expansion, often  have
              far  fitter off-spring than the off-spring of the top-scoring (domi‐
              nating) members of the metapopulation. Thus, the  "weak",  dominated
              members  of the metapopulation are important for ensuring the vital‐
              ity of the metapopulation as a whole, and are discarded only at con‐
              siderable risk to the future adaptability of the overall population.
              Put another way: specifying this flag makes it more likely that  the
              metapopulation will get trapped in a non-optimal local maximum.

              Note  that  the algorithms to compute domination are quite slow, and
              so keeping dominated individuals has a double benefit: not  only  is
              the  metapopulation  healthier,  but  metapopulation management runs
              faster.

       -L1, --hc-single-step=1
              Single-step, instead of hill-climbing to the top of a hill. That is,
              a  single  uphill  step  is  taken, and the resulting best demes are
              folded back into the  metapopulation.   Solving  then  continues  as
              usual. By default, the hill-climbing algorithm does not single-step;
              it instead continues to the top of the local  hill,  before  folding
              the  resulting  demes  back  into the metapopulation.  If using this
              flag, consider using the -T1 flag to allow the search to be widened,
              so  that  if  the  initial exemplar is already at the top of a local
              hill, a search is made for a different (taller) hill.

       -N oper, --include-only-operator=oper
              Include the operator oper, but  exclude  others,  in  the  solution.
              This option may be used several times to specify multiple operators.
              Currently, oper may be one of  plus,  times,  div,  sin,  exp,  log,
              impulse  or  a  variable  #n.  Note that variables and operators are
              treated separately, so that including only some operators will still
              include  all  variables,  and  including  only  some variables still
              include all operators).  You may need to put variables under  double
              quotes.  This option does not work with ANN.

       -P num, --pop-size-ratio=num
              Controls amount of time spent on a deme. Default value is 20.

       -p fraction, --noise=fraction
              This  option provides an alternative means of setting the complexity
              ratio.  If specified, it over-rides the  -z  option.   For  discrete
              problems, fraction can be interpreted as being the fraction of score
              values that are incorrect (e.g. due to noisy data).  As  such,  only
              values  in  the  range  0 < fraction < 0.5 are meaningful (i.e. less
              than half of the data values are  incorrect).   Typical  recommended
              values  are  in  the  range of 0.001 to 0.05.  For continuous-valued
              problems, it can be interpreted  as  the  standard  deviation  of  a
              Gaussian noise in the dependent variable.

              For  the  discrete  problem,  the complexity ratio is related to the
              fraction  p by the explicit formula:

                  complexity_ratio = - log(p/(1-p)) / log |A|

              where |A| is the (problem-dependent) alphabet size.  See below for a
              detailed explanation.

       -T1, --hc-widen-search=1
              Controls  hill-climbing algorithm behavior.  If false (the default),
              then deme search terminates when a local hilltop is found. If  true,
              then  the search radius is progressively widened, until another ter‐
              mination condition is met.  Consider using the -D flag  to  set  the
              maximum  search  radius.   Note  that  the  size of the search space
              increases exponentially with the search  radius,  so  this  kind  of
              search becomes very rapidly intractible.

       --well-enough=1
              For  problems  with  an  enumerated  ('nominal') output, the learned
              combo program is always  of  the  form  cond(pred_1  value_1  pred_2
              value_2  ...  pred_n value_n else_val)  where pred_1 is a predicate,
              which, if true, causes the output to be  value_1.   If  false,  then
              pred_2  is tested, and so on.  If none of the predicates evaluate to
              true, then the value of the cond expression is  the  else_val.   The
              well-enough algorithm attempts to find predicates that maximize pre‐
              cision, the point being that if a perfectly precise  pred_1  can  be
              found,  then  it can be left alone ('leave well-enough alone'), thus
              simplifying the remainder of the search  problem.   Performing  this
              evaluation is costly, and may lead to a slow-down, without improving
              overall accuracy.

   Controlling RAM usage
       Deploying moses on large problems can cause all available RAM  on  a  given
       machine  to  be used up.  Actual RAM usage depends strongly on the dataset,
       and on various tuning paramters. Typical tasks require systems with  dozens
       or hundreds of gigabytes of RAM.

       There  are  three primary consumers of RAM within moses: the metapopulation
       (the set of combo trees being evolved), the  deme  (the  set  of  instances
       describing the knob setttings, when a single combo tree is being explored),
       and the score casche.  The following options can be used to limit the  size
       of each of these to a more managable size.

       -s num, --cache-size=num
              Enable  memoization  of candidate scores.  This allows the number of
              scoring function evaluations to be reduced, by maintaining  a  cache
              of  recently  scored  candidates. If a new candidate is found in the
              cache, that score is used, instead of a scoring function evaluation.
              The  effectiveness of memoization is greatly increased by also using
              the -d1 flag.  The maximum number of cached scores is  set  at  num,
              which  defaults to 3000. Careful: scores can be quite large, and use
              a lot of memory.

       --cap-coef=num
              A coefficient used to control the size of  the  metapopulation.  The
              metapopulation is trimmed, by removing a random subset of the lowest
              scorers, so that the maximum size is given by the formula

                  pop-size = cap_coef * (N + 250) * (1 + 2 * exp(-N/500))

              where N is the number of metapop  iterations  (tree  expansions)  so
              far.  The default value for this option is 50.

              Note that the prefered way of keeping the metapopulation small is to
              set a low temperature (the --complexity-temperature option).  For  a
              given temperature, trees which have such a poor score that they have
              a probability of less than one in a billion of  being  selected  are
              automatically  discarded  from  the  metapopulation.  In most cases,
              this is enough to keep the metapopulation under control. However, if
              there are a vast number of trees with almost exactly the same score,
              the temperature alone will not be enough to control the  metapopula‐
              tion size.

              If the metapopulation is growing out of control, then do verify that
              the temerpature is set to an appropriatte value, before  making  use
              of this flag.

       --hc-resize-to-fit-ram=1
              If  set  to true, this will cause moses to try to limit memory usage
              so that it will fit in available RAM (and thus avoid an  OOM  kill).
              The  RAM usage is controlled by automatically resizing the deme dur‐
              ing the hillclimbing search. (The deme contains  instances  of  knob
              settings  that were explored).  Note that the use of this option may
              introduce  indeterminacy,  when  comparing   runs   from   different
              machines, having different amounts of installed RAM.

   Output control options
       These options control the displayed output.  Note that, be default, the ten
       best solutions are printed. These  are  printed  in  penalized-score-sorted
       order,  but  the  actual printed score is the raw, un-penalized score. This
       can lead to the printed list seeming to be in random order, which can occur
       if  the penalties are high.  The penalties can be too high, if the complex‐
       ity ratio is set too low, or the temperature is set too low.

       -C1, --output-only-best=1
              Print only the highest-scoring candidates.

       -c count, --result-count=count
              The number of results to return, ordered according to score. If neg‐
              ative, then all results are returned.

       -f filename, --log-file=filename
              Write  debug log traces filename. If not specified, traces are writ‐
              ten to moses.log.

       -F, --log-file-dep-opt
              Write debug log traces to a filename  constructed  from  the  passed
              option flags and values. The filename will be truncated to a maximum
              of 255 characters.

       -l loglevel, --log-level=loglevel
              Specify the level of detail for debug logging. Possible  values  for
              loglevel are NONE, ERROR, WARN, INFO, DEBUG, and FINE. Case does not
              matter.  Caution: excessive logging detail can lead  to  significant
              program slowdown.  The NONE option disables log file creation.  This
              may make error debugging difficult.

       -o filename, --output-file=filename
              Write results to filename. If not specified, results are written  to
              stdout.

       --python=1, --python3=1
              Output  the  highest-scoring programs as python snippets, instead of
              combo.

       -S0, --output-score=0
              Prevent printing of the score.

       -t1, --output-bscore=1
              Print the behavioral score.

       -V1, --output-eval-number=1
              Print the number of evaluations performed.

       -W1, --output-with-labels=1
              Use named labels instead of  position  place-holders  when  printing
              candidates.  For  example,  *("$temperature"  "$entropy") instead of
              *($3 $4). This option is effective only when the data file  contains
              labels in its header.

       -x1, --output-complexity=1
              Print the complexity measure of the model, and the scoring penalty.

   Precision, recall, BEP, F_1 and prerec problem types
       The  prerec,  recall,  bep,  f_one  and precision problem types are used to
       solve binary classification problems:  problems where the goal is  to  sort
       inputs  into  one  of  two sets, while maximizing either the precision, the
       sensitivity, or some other figure of merit of the test.

       In moses, precision and recall (sensitivity) are defined as usual.   Preci‐
       sion  is  defined as the number of true positives, divided by the number of
       true positives plus the number of false positives.  Classifiers with a high
       precision  make very few mistakes identifying positives: they have very few
       or no false positives.  However, precise classifiers may completely fail to
       identify  many,  if  not most positives; they just don't make mistakes when
       they do identify them.

       Recall, also known as sensitivity, is defined as the number of  true  posi‐
       tives  divided  by  the sum of the number of true positives and false nega‐
       tives.  Classifiers with high recall will identify most, or maybe even  all
       positives;  however,  they  may  also identify many negatives, thus ruining
       precision.

       A trivial way to maximize precision is to have a very low recall rate,  and
       conversely,  one  can  very  easily have a good recall rate if one does not
       mind a poor precision.  Thus a common goal is to maximize one, while  hold‐
       ing  the other to a minimum standard.  One common problem is find a classi‐
       fier with the highest possible recall, while holding precision to  a  fixed
       minimum  level;  this may be accomplished with the -Hrecall option.  Alter‐
       nately, one may desire to maximize precision, while maintaining  a  minimum
       sensitivity; this may be accomplished with the -Hprerec option.  Note that,
       although these two proceedures seem superficially similar, they  can  often
       lead  to  dramatically different models of the input data.  This is in part
       because, during early stages, moses will choose exemplars that maximize one
       or  the  other,  thus  causing dramatically different parts of the solution
       space to be searched.

       A common alternative to maximizing one or the other is to  maximize  wither
       the  arithmetic  or  the  harmonic mean of the two.  The arithmetic mean is
       sometimes called the "break-even point" or BEP; it is  maximized  when  the
       -Hbep option is specified.  The harmonic mean is known as the F_1 score, it
       is maximized when the -Hf_one option is specified.

       moses also provides a second way of maximizing precision, using  the  -Hpre
       option.   This  option  searches  for  the test with the highest precision,
       while holding the 'activation' in  a  bounded  range.   The  definition  of
       'activation'  is  idiosyncratic  to moses; it is defined as the sum of true
       positives plus false positives: that is, it is the  fraction  of  rows  for
       which  the  trial  combo  program returned a positive answer, regardless of
       whether this was the right answer.  Activation ranges from 0.0, to 1.0.  It
       is never desirable to maximize activation; rather, most commonly, one wants
       to peg activation at exactly the fraction of positives in the training set.

       The minimum level to which a fixed component should be held may  be  speci‐
       fied  with the -q or --min-rand-input option.  Thus, for the -Hrecall prob‐
       lem, the -q flag is used to specify the minimum desired  precision.   Simi‐
       larly, for the -Hprerec problem, the -q flag is used to specify the minimum
       desired recall.  For the -Hpre problem, the  -w or --max-rand-input  option
       should be used to make sure the activation does not get too high.

       The  -q  and -w options also set lower and upper bounds for the BEP problem
       as well.   When maximizing BEP, the system attempts to  keep  the  absolute
       value  of  the difference between precision and recall less than 0.5.  This
       maximum difference can be over-ridden with the -w option.

       Adherence to the bounds is done by means of a scoring penalty;  combo  pro‐
       grams that fail to lie within bounds are penalized.  The harshness or hard‐
       ness of the penalty may be specified by means of the -Q or --alpha  option.
       Values much greater than one enforce a hard boundary; values much less than
       one make for a very soft boundary.  Negative values are invalid.

   Contin options
       Options that affect the  usage  of  continuously-valued  variables.   These
       options  specify  values  that  are  used  in  a variety of different ways,
       depending on the chosen problem type.  See appropriate  sections  for  more
       details.

       -Q hardness, --alpha=hardness
              The  harshness  of  hardness  of  a  limit  that must be adhered to.
              Default 0.0 (limits disabled).

       -q num, --min-rand-input=num
              Minimum value for continuous variables. Default 0.0.

       -w num, --max-rand-input=num
              Maximum value for continuous variables.  Default 1.0.

       -R num, --discretize-threshold=num
              Split a continuous domain into two pieces. This option maybe be used
              multiple  times  to  split a continuous domain into multiple pieces:
              that is, n uses of this option will create n+1 domains.

   Demo options
       These options pertain to the various  built-in  demo  and  example  problem
       modes.   Such demo problems are commonly used to evaluate different machine
       learning algorithms, and are thus included here to facilitate such compari‐
       son, as well as to simplify moses regression and performance testing.

       -H type, --problem-type=type
              A  number of demonstration problems are supported. In each case, the
              top results are printed to stdout, as a score, followed by  a  combo
              program.  type may be one of:

              cp    Combo program regression. The scoring function is based on the
                    combo program specified with the -y flag. That is, the goal of
                    the run is to deduce and learn the specified combo program.

                    When  specifying  combo  programs with continuous variables in
                    them, be sure to use the -q, -w and  -b  flags  to  specify  a
                    range of input values to be sampled. In order to determine the
                    fitness of any candidate, it must be compared to the specified
                    combo program.  The comparison is done at a variety of differ‐
                    ent input values. If the range  of  sampled  input  values  is
                    inappropriate, or if there are not enough sampled values, then
                    the fitness function may select unexpected,  undesired  candi‐
                    dates.

              dj    Disjunction problem. The scoring function awards a result that
                    is a boolean disjunction (or) of N  boolean-valued  variables.
                    The resulting combo program should be or($1 $2 ...).  The size
                    of the problem may be specified with the -k option.

              mux   Multiplex problem. The scoring function models a boolean digi‐
                    tal  multiplexer,  that  is,  an  electronic  circuit where an
                    "address" of n bits selects one and only one line, out of  2^n
                    possible  lines.  Thus,  for example, a single address bit can
                    select one of two possible lines: the first, if its false, and
                    the  second, if its true. The -k option may be used to specify
                    the value of n.  The actual size of the problem,  measured  in
                    bits, is n+2^n and so increases exponentially fast.

              pa    Even parity problem.  The resulting combo program computes the
                    parity of k bits, evaluating to true if the  parity  is  even,
                    else  evaluating  to  false.   The  size of the problem may be
                    specified with the -k option.

              sr    Polynomial   regression   problem.   Given   the    polynomial
                    p(x)=x+x^2+x^3+...x^k,  this searches for the shortest program
                    consisting of nested arithmetic  operators  to  compute  p(x),
                    given  x as a free variable. The arithmetic operators would be
                    addition, subtraction, multiplication and division; exponenti‐
                    ation  is not allowed in the solution.  So, for example, using
                    the -k2 option to specify the order-2 polynomial  x+x^2,  then
                    the  shortest  combo  program  is  *(+(1 $1) $1) (that is, the
                    solution is p(x)=x(x+1) in the usual arithmetical notation).

       -k size, --problem-size=size
              Specify the size of the problem.  The interpretation of size depends
              on the particular problem type.

       -y prog, --combo-program=prog
              Specify  the  combo  program to be learned, when used in combination
              with the -H cp option.  Thus, for example, -H cp -y  "and(\$1  \$2)"
              specifies  that the two-input conjunction is to be learned.  Keep in
              mind that $ is a reserved character in many shells, and thus must be
              escaped with a backslash in order to be passed to moses.

COMPLEXITY PENALTY
       The speed with which the search algorithm can find a reasonable solution is
       significantly affected by the complexity ratio specified with the -z or  -p
       options. This section provides the theoretical underpinning for the meaning
       of these flags, and how they affect  the  the  algorithm.   The  complexity
       penalty  has  two  slightly different interpretations, depending on whether
       one is considering learning a discretely-valued problem (i.e.  boolean-val‐
       ued)  or a continuously-valued problem.  The general structure of the argu‐
       ment is broadly similar for both cases; they are presented below.   Similar
       arguments apply for classification problems (learning to classify data into
       one of N categories), and for precision maximization.

   Discrete case
       Let M be the model to be learned (the combo program).  Let D be  the  data,
       assumed  to  be  a table of n inputs i_k and one output o, with each row in
       the form:

           i_1 ... i_n o

       Here, i_k is the k'th input and o the output.  In the below, we write  o  =
       D(x) where x=(i_1, ..., i_n) is an input data row.

       We  want to assess the probability P(M|D) of the model M conditioned on the
       data D.  In particular, we wish to maximize this, as it provides  the  fit‐
       ness function for the model.  According to Bayes theorem,

           P(M|D) = P(D|M) * P(M) / P(D)

       Consider  the log likelihood LL(M) of M knowing D.  Since D is constant, we
       can ignore P(D), so:

           LL(M) = log(P(D|M)) + log(P(M))

       Assume each output of M on row x has probability p of being wrong.  So,

           P(D|M) = Prod_{x\in D} [p*(M(x) != D(x)) + (1-p)*(M(x) == D(x))]

       where D(x) the observed result given input x.  Then,

           log P(D|M) = Sum_{x\in D} log[p*(M(x) != D(x)) + (1-p)*(M(x) == D(x))]

       Let D = D_eq \cup D_ne  where D_eq and D_ne are the sets

           D_eq = {x \in D | M(x) == D(x) }
           D_ne = {x \in D | M(x) != D(x) }

       Then

           log P(D|M) = Sum_{x\in D_ne} log(p) + Sum_{x\in D_eq} log(1-p)
                      = |D_ne| log(p) + |D_eq| log(1-p)
                      = |D_ne| log(p) + |D| log(1-p) - |D_ne| log(1-p)
                      = |D_ne| log(p/(1-p)) + |D| log(1-p)

       Here, |D| is simply the size of set D, etc.  Assuming that p is small, i.e.
       much less than one, then, to second order in p:

          log(1-p) = -p + p^2/2 + O(p^3)

       So:

          log P(D|M) = |D_ne| log(p) - p (|D| - |D_ne|) + O(p^2)

       Next,  assume  P(M) is distributed according to Solomonoff's Universal Dis‐
       tribution, approximated by (for now)

           P(M) = |A|^-|M|
                = exp(-|M|*log(|A|))

       where A is the alphabet of the model, |A| is the alphabet size, and |M|  is
       the  complexity  of the model.  Note that this distribution is identical to
       the Boltzmann distribution, for an inverse temperature of log(|A|). Putting
       it all together, the log-likelihood of M is:

           LL(M) = -|M|*log(|A|) + |D_ne| log(p/(1-p)) + |D| log(1-p)

       To  get  an  expression  usable  for a scoring function, just bring out the
       |D_ne| by dividing by -log(p/(1-p)), to get

           score(M) = - [ LL(M) - |D| log(1-p) ] / log(p/(1-p))
                    = -|D_ne| + |M|*log|A| / log(p/(1-p))
                    = -|D_ne| - |M| |C_coef|

       Note that, since p<1, that log(p) is negative, and so the  second  term  is
       negative.   It  can  be  understood  as  a complexity penalty.  That is, we
       define the complexity penalty as

          complexity_penalty = |M| |C_coef|

       The complexity ratio, as set by the -z option, is given by

          complexity_ratio = 1 / |C_coef|

       By contrast, the -p option may be used to set p directly, as given  in  the
       formulas  above.  The value of |A| is computed internally, depending on the
       specific problem type (discrete vs. continuous, number of included-excluded
       operators,  etc.)   The complexity of each solution is also computed, using
       an ad-hoc complexity measure.

   Continuous case
       A similar argument to the above holds for the case of a continuously-valued
       observable.

       Let  dP(..)  be  the  notation  for a probability density (or measure).  As
       before, start with Bayes theorem:

           dP(M|D) = dP(D|M) * P(M) / P(D)

       Since D is constant, one may ignore the prior P(D), and write the log like‐
       lihood of M knowing D as:

           LL(M) = log(dP(D|M)) + log(P(M))

       Assume  the  output  of  of the model M on input x has a Gaussian distribu‐
       tions, of mean M(x) and variance V, so that dP(D|M), the  probability  den‐
       sity of the data D given the modem M is:

           dP(D|M) = Prod_{x\in D} (2*Pi*V)^(-1/2) exp(-(M(x)-D(x))^2/(2*V))

       As before, assume a model distribution of

           P(M) = |A|^-|M|

       where  |A| is the alphabet size and |M| the complexity of the model.  After
       simplification, and dropping a constant term that does not depend on either
       the  model  complexity  or  the  dataset itself (the dataset size is a con‐
       stant), one then can deduce a scoring function:

           score(M) = -|M|*log(|A|)*2*V - Sum_{x\in D} (M(x)-D(x))^2

       As before, |M|*log(|A|)*2*V  can  be  interpreted  as  a  scoring  penalty.
       Alternately,  one  may  interpret each row x as a feature; then the penalty
       term |M|*log(|A|)*2*V can be interpreted as an additional feature that must
       be fit.

DISTRIBUTED PROCESSING
       moses  provides  two different styles of distributed processing for cluster
       computing systems.  One style is to use MPI (as implemented  in  the  Open‐
       MPI/MPICH2  systems),  the  second  is  to use SSH. The first style is best
       suited for local area networks (LANs)  and  compute  clusters.  The  second
       style  allows operation over the open Internet, but is more problematic, as
       it may require manual cleanup of log files and failed jobs.  Because MPI is
       easy  to  install and manage, it is the recommended method for distributing
       moses operation across many machines.

       When moses is run in distributed fashion, one single node, the  root  node,
       maintains control over a pool of workers that execute on remote nodes.  The
       root node maintains the set of candidate solutions, and  assigns  these  to
       the  workers  for  additional exploration, as the workers become free.  The
       results are automatically collected by  the  root,  and  are  automatically
       merged  into  the candidate population.  When termination criteria are met,
       processing will terminate on all nodes, and the root node will  report  the
       merged, best results.

   MPI
       Using MPI requires a moses binary with MPI support compiled in, with either
       the OpenMPI or the MPICH2 implementations.  MPI support in moses is enabled
       with the --mpi=1 command-line flag.  When this flag is specified, moses may
       be run as usual in an MPI environment.  Details will vary from  one  system
       configuration to another, but a typical usage might resemble the following:

       mpirun -n 15 --hostfile mpd.hosts moses --mpi=1 -j 12 <other moses params>

       The  above specifies that the run should be distributed over fifteen nodes,
       with the node names specified in the  mpd.hosts  file.   The  --mpi=1  flag
       indicates  to  moses  that it is being run in an MPI environment.  The -j12
       flag tells moses to use up to twelve threads on each node, whenever  possi‐
       ble; this example assumes each node has 12 cores per CPU.

       To maximize CPU utilization, it seems best to specify two MPI instances per
       node.  This is because not all parts of moses are  parallelized,  and  some
       parts are subject to lock contention.  Thus, running multiple instances per
       node seems to be an effective way to utilize all available compute power on
       that node.  It is almost always the case that moses RAM usage is relatively
       small, and so RAM availability is rarely a problem.  The  network  utiliza‐
       tion  by moses is also very modest: the only network traffic is the report‐
       ing of candidate solutions, and so the network demands are typically in the
       range of 1 Megabit per second, and are thus easily supported on an Ethernet
       connection.  The moses workload distributes in an 'embarrassingly parallel'
       fashion,  and so there is no practical limit to scaling on small and medium
       compute clusters.

       When performing input data regression, be sure that the input data file  is
       available on all nodes.   This is most easily achieved by placing the input
       data file on a shared filesystem.  Each instance of moses will write a  log
       file.   In  order  to avoid name collision on the log files, the process id
       (PID) will automatically be incorporated into the log-file  name  when  the
       --mpi=1  option  is  specified.  Log file creation can be disabled with the
       -lNONE option; however, this is not recommended, as it makes debugging  and
       progress monitoring difficult.

   SSH
       The  SSH  style of distributed processing uses the ssh command to establish
       communications and to control the pool of workers.  Although moses provides
       job  control  when  the system is running normally, it does not provide any
       mechanism for cleaning up after hung or failed jobs;  this is  outside  the
       scope of the ssh implementation.  The use of a job manager, such as LSF, is
       recommended.

       Remote machines are specified using the -j option, using  the  notation  -j
       N:REMOTE_HOST.   Here,  N  is  the  number of threads to use on the machine
       REMOTE_HOST.    For   instance,   one   can   enter   the    options    -j4
       -j16:my_server.org  (or  -j16:user@my_server.org  if  one wishes to run the
       remote job under a different user name), meaning that 4 threads  are  allo‐
       cated  on  the local machine and 16 threads are allocated on my_server.org.
       Password prompts will appear unless ssh-agent is  being  used.   The  moses
       executable  must  be  on  the remote machine(s), and located in a directory
       included in the PATH environment variable.  Beware that a lot of log  files
       are going to be generated on the remote machines.

DIVERSITY
       moses can enforce diversity in the metapopulation, by taking into account a
       diversity pressure during ranking, effectively  pushing  down  semantically
       redundant  candidates  to  the  bottom  of  the  metapopulation. This has 2
       effects

       1. Selected exemplars for new demes are more likely to be diverse.

       2. Top candidates output to the user at the end of learning are more likely
       to be diverse.

       The  diversity  pressure  operates  over the semantics of the candidates by
       measuring the distance between the behavioral scores of the current  candi‐
       date  to  be  ranked and the top candidates already ranked. The exact algo‐
       rithm       is       described       in        Section        1.3        of
       https://github.com/opencog/moses/blob/master/moses/moses/documenta‐
       tion/diversity_report_v0.3_disclosable.pdf.

   Diversity options
       User options can be used to determine how much pressure to  apply  and  how
       the  distance between behavioral scores should be calculated. If you do not
       wish to understand much of these options but still want to  enforce  diver‐
       sity  simply set a positive diversity pressure (typically around 0.01) with
       option --diversity-pressure, and enable autoscale with option  --diversity-
       autoscale=1.

       --diversity-pressure=num
              Set  a  diversity  pressure on the metapopulation. Programs behaving
              similarily to others are more penalized. That value sets the  impor‐
              tance  of  that  penalty.  It ranges from 0 to +inf. If --diversity-
              autoscale is enabled (recommended) then it typically ranges  from  0
              to 1.  Default 0.0.

       --diversity-autoscale=bool
              Automatically  rescale  the --diversity-pressure parameter so that a
              pressure may typically range from 0 to 1, regardless of the  fitness
              function.  Although  this option is disabled by default for backward
              compatibility, it is highly recommended.  Default 0.

       --diversity-exponent=num
              Set the exponent of the generalized mean (or  sum,  if  --diversity-
              normalize is set to 0) aggregating the penalties between a candidate
              and the set of  all  candidates  better  than  itself  (taking  into
              account  diversity).  If  the  value tends towards 0 it tends to the
              geometric mean, towards +inf it tends to the max function. If  nega‐
              tive or null is it the max function.  Default -1.0.

       --diversity-normalize=bool
              If  set  to  1  then the aggregating function is a generalized mean.
              Otherwize it is a generalized sum (generalize mean * number of  ele‐
              ments).   If  --diversity-exponent  is  set  to negatively then this
              doesn't have any impact as the aggregating function is the max  any‐
              way.  Default 1.

       --diversity-dst=distance
              Set  the  distance between behavioral scores, then used to determine
              the uniformity penalty. 3 distances are available: p_norm,  tanimoto
              and angular.  Default p_norm.

       --diversity-p-norm=num
              Set the parameter of the p_norm distance. A value of 1.0 corresponds
              to the  Manhatan  distance.  A  value  of  2.0  corresponds  to  the
              Euclidean  distance.   A  value of 0.0 or less correspond to the max
              component-wise. Any other value corresponds  to  the  general  case.
              Default 2.0.

       --diversity-dst2dp=type
              Set the type of function to convert distance into penalty. 4 options
              are available: auto, inverse, complement and  power.  When  auto  is
              selected  the function is selected depending on the distance, if the
              distance is p_norm, then inverse is selected,  otherwise  complement
              is selected.  Default auto.

TODO
       Finish documenting these algo flags: --fs-scorer -M

       -R discretize target var

       These input flags: -G

       Interesting patterns flags: -J -K -U -X

SEE ALSO
       The eval-table man page.

       More information is available at http://wiki.opencog.org/w/MOSES

AUTHORS
       moses was written by Moshe Looks, Nil Geisweiller, and many others.

       This manual page is being written by Linas Vepstas. It is INCOMPLETE.

3.7.0                              March 8, 2019                          MOSES(1)