MOSES man page

From OpenCog
Jump to: navigation, search

Man page for MOSES (raw dump).

 MOSES(1)                       OpenCog Learning                       MOSES(1)



NAME
       moses - meta-optimizing semantic evolutionary search solver

SYNOPSIS
       moses -h|--help
       moses --version
       moses   [-H   problem_type]  [-i  input_filename]  [-u  target_feature]
       [--enable-fs=1] [-Y ignore_feature] [-m max_evals] [--max-time seconds]
       [-f logfile] [-l loglevel] [-r random_seed] [--mpi=1] [-n ignore_opera‐
       tor] [-N use_only_operator] [-o output_file] [--boost]  [-A  max_score]
       [-a  algo]  [-B  knob_effort] [-b nsamples] [-C1] [-c result_count] [-D
       max_dist] [-d1] [-E reduct_effort] [-e  exemplar]  [-F]  [-g  max_gens]
       [-I0]  [-j jobs] [-k problem_size] [-L1] [-P pop_size_ratio] [-p proba‐
       bility] [--python=1] [-Q hardness] [-q min] [-R threshold] [-S0]  [-s1]
       [--score-weight  column]  [-T1]  [-t1] [-V1] [-v temperature] [-W1] [-w
       max] [--well-enough=1] [-x1]  [-y  combo_program]  [-Z1]  [-z  complex‐
       ity_ratio]

DESCRIPTION
       moses  is  the  command-line  wrapper  to  the  MOSES  program learning
       library. It may be used to solve learning  tasks  specified  with  file
       inputs, or to run various demonstration problems.  Given an input table
       of values, it will perform a regression, generating, as output, a combo
       program  that  best  fits the data. moses is multi-threaded, and can be
       distributed across multiple machines to improve run-time performance.

EXAMPLE
       As an example, the input file of comma-separated values:

        a,b,out
        0,0,0
        0,1,0
        1,0,0
        1,1,1

       when run with the flags -Hit -u out -W1 will generate the combo program
       and($a $b).  This  program  indicates that the third column can be mod‐
       elled as the boolean-and of the first two.   The  -u  option  specifies
       that  it  is the third column that should be modelled, and the -W1 flag
       indicates that column labels, rather than  column  numbers,  should  be
       printed on output.


OVERVIEW
       moses  implements  program  learning by using a meta-optimization algo‐
       rithm. That is, it uses two optimization algorithms, one wrapped inside
       the other, to find solutions.  The outer optimization algorithm selects
       candidate programs (called exemplars), and then  creates  similar  pro‐
       grams  by  taking an exemplar and inserting variables (called knobs) in
       selected locations. The inner optimization algorithm then searches  for
       the best possible 'knob settings', returning a set of programs (a deme)
       that provide the best fit to the data. The outer optimization algorithm
       then selects new candidates, and performs a simplification step, reduc‐
       ing these to a normal form, to create new exemplars.   The  process  is
       then  repeated  until  a  perfect score is reached, a maximum number of
       cycles is performed, or forward progress is no longer being made.

       Program reduction is performed at two distinct stages: after  inserting
       new  knobs,  and  when candidates are selected.  Reduction takes a pro‐
       gram, and transforms it into a new form, the so-called  elegant  normal
       form,  which,  roughly speaking, expresses the program in the smallest,
       most compact form possible.  Reduction is  an  important  step  of  the
       algorithm,  as  it  prevents  the  evolved  programs  from turning into
       'spaghetti code', avoiding needless complexity  that  can  cost  during
       program evaluation.

       The  operation  of  moses  can  be modified by using a large variety of
       options to control different parts of the  above  meta-algorithm.   The
       inner  optimization  step  can  be  done using one of several different
       algorithms, including hillclimbing, simulated annealing, or  univariate
       Bayesian  optimization.   The  amount of effort spent in the inner loop
       can be limited, as compared to how frequently new demes  and  exemplars
       are  chosen.  The  total number of demes or exemplars maintained in the
       population may be controlled.  The effort expended on  program  normal‐
       ization at each of the two steps can be controlled.

       In addition to controlling the operation of the algorithm, many options
       are used to describe the format of the input data, and to  control  the
       printing of the output.

       moses also has a 'demo' mode: it has a number of built-in scoring func‐
       tions for various 'well-known' optimization problems, such  as  parity,
       disjunction,  and multiplex. Each of these typically presents a differ‐
       ent set of challenges to the optimization algorithm, and  are  used  to
       illustrate  'hard' optimization problems.  These demo problems are use‐
       ful both for learning and understanding the use of moses, and also  for
       developers  and  maintainers of moses itself, when testing new features
       and enhancements.


COMBO PROGRAMS
       This section provides a brief overview of the  combo  programming  lan‐
       guage.   Combo is not meant so much for general use by humans for prac‐
       tical programming, as it is to fit the needs of program  learning.   As
       such, it is fairly minimal, while still being expressive enough so that
       all common programming constructs are simply represented in combo.   It
       is  not difficult to convert combo programs into other languages, or to
       evaluate them directly.

       In combo, all programs are represented as program trees. That is,  each
       internal  node  in  the tree is an operation and leaves are either con‐
       stants or variables. The set of all of the variables in a  program  are
       taken as the arguments to the program. The program is evaluated by sup‐
       plying explicit values for the variables, and then applying  operations
       until the root node is reached.

       By  convention,  variables  are  denoted by a dollar-sign followed by a
       number, although they may also be named. Variables are  not  explicitly
       typed;  they are only implicitly typed during program evaluation. Thus,
       for example, or ($1 $2) is a combo program that represents the boolean-
       or  of  two  inputs,  $1  and $2.   Built-in constants include true and
       false; numbers may be written with or without a decimal point.

       Logical operators include and, or and not.

       Arithmetic operators include +, - * and /  for  addition,  subtraction,
       multiplication  and  division.  Additional arithmetic operators include
       sin, log, exp and rand. The rand operator returns a value in the  range
       from 0.0 to 1.0.  The predicate operator 0< (greater than zero) takes a
       single continuous-valued expression, and returns a boolean. The  opera‐
       tor impulse does the opposite: it takes a single boolean-valued expres‐
       sion, and returns 0.0 or 1.0, depending on whether the argument  evalu‐
       ates to false or true.

       The  cond operator takes an alternating sequence of predicates and val‐
       ues, terminated by an 'else' value.  It evaluates as  a  chain  of  if-
       then-else  statements, so that it returns the value following the first
       predicate to evaluate to true.

       Thus, as an example, the expression and( 0<( +($x 3))   0<($y))  evalu‐
       ates to true if $x is greater than -3 and $y is greater than zero.

       The  above operators are built-in; combo may also be extended with cus‐
       tom-defined operators, although C++ programming  and  recompilation  is
       required for this.

OPTIONS
       Options  fall  into four classes: those for specifying inputs, control‐
       ling operation, printing output, and running moses in demo mode.


   General options
       -e exemplar
              Specify an initial exemplar to use. Each exemplar is written  in
              combo. May be specified multiple times.

       -h, --help
              Print command options, and exit.

       -j num, --jobs=num
              Allocate  num  threads  for  deme  optimization.  Using multiple
              threads will speed the search on  a  multi-core  machine.   This
              flag  may also be used to specify remote execution; see the sec‐
              tion Distributed processing below.

       --version
              Print program version, and exit.

   Problem-type options
       MOSES is able to handle a variety of different 'problem types', such as
       regression,  categorization and clustering, as well as a number of demo
       problems, such as parity and factorization.  The -H option is  used  to
       specify  the problem type; the demo problem types are listed in a later
       section. This section lists the various  "table"  problems,  where  the
       goal is to train moses on an input table of values.


       -H type, --problem-type=type
              The type of problem may be one of:

              it       Regression on an input table.  That is, the input table
                       consists of a set of columns, all  but  one  considered
                       'inputs', and one is considered an output.  The goal of
                       regression is to learn a combo program that most  accu‐
                       rately  predicts  the  output.   For boolean-valued and
                       enumerated outputs, the scoring function simply  counts
                       the  number of incorrect answers, and tries to minimize
                       this score.  That is, this scorer attempts to  maximize
                       accuracy  (defined  below).  For contin-valued outputs,
                       the mean-square variation is minimized.

              pre      Regression on an  input  table,  maximizing  precision,
                       instead  of accuracy (that is, minimizing the number of
                       false positives, at the risk of  sometimes  failing  to
                       identify  true  positives).  Maximization is done while
                       holding activation constant.  This scorer is ideal  for
                       learning  "domain experts", which provide a result only
                       when they are certain that the result is correct.   So,
                       for  a  binary  classifier,  the  output is meant to be
                       interpreted as "yes, certain" or "no, don't know".   An
                       ensemble  of such combo expressions is commmonly called
                       a "mixture of experts".


              prerec   Regression on an  input  table,  maximizing  precision,
                       while  attempting  to maintain the recall (sensitivity)
                       at or above a given level.

              recall   Regression on an input table, maximizing recall (sensi‐
                       tivity)  while  attempting to maintain the precision at
                       or above a given level.  This scorer is  most  commonly
                       used  when it is important to guarantee a certain level
                       of precision, even if it means rejecting  most  events.
                       In  medicine  and physics/radio applications, recall is
                       exactly the same  thing  as  sensitivity:  this  option
                       searches for the most sensitive test while holding to a
                       minimum level of precision.

              bep      Regression on an input table, maximizing the arithmetic
                       mean  of  the  precision  and recall, also known as the
                       "break-even point" or BEP.

              f_one    Regression on an input table, maximizing  the  harmonic
                       mean  of  the  precision  and  recall, that is, the F_1
                       score.

              ip       Discovery of "interesting predicates" that select  rows
                       from the input table. The data table is assumed to con‐
                       sist of a number of boolean-valued input columns, and a
                       contin-valued  (floating  point)  target  column. moses
                       will learn predicates that select the  most  "interest‐
                       ing"  subset  of  the rows in the table.  The values in
                       the output columns of the selected rows form  a  proba‐
                       bility distribution (PDF); this PDF is considered to be
                       "interesting" if it maximizes a linear  combination  of
                       several  different  measures  of the the PDF: the Kull‐
                       back-Leibler divergence, the skewness,  and  the  stan‐
                       dardized Mann-Whitney U statistic.

              kl       Regression  on  an input table, by maximizing the Kull‐
                       back-Leibler divergence between the distribution of the
                       outputs.   That  is,  the  output  must  still be well-
                       scored, but it is assumed that there are many  possible
                       maxima.  (XXX???) Huh?

              ann-it   Regression  on  an input table, using a neural network.
                       (kind-of-like a hidden Markov model-ish, kind  of.  XXX
                       Huh???)


       For  boolean-valued  data tables, the scorers make use of the following
       generally-accepted, standard defintions:

       TP     True-positive: the sum of all rows for  which  the  combo  model
              predicts T for a data-table row marked T.

       FP     False-positive:  the  sum  of all rows for which the combo model
              predicts T for a data-table row marked F.

       FN     False-negative: the sum of all rows for which  the  combo  model
              predicts F for a data-table row marked T.

       TN     True-negative:  the  sum  of  all rows for which the combo model
              predicts F for a data-table row marked F.

       accuracy
              Defined as the formula (TP + TN) / (TP + TN + FP + FN)

       activation
              Defined as the formula (TP + FP) / (TP + TN + FP + FN)

       precision
              Defined as the formula TP / (TP + FP)

       recall Also known as sensitivity, this is defined as the formula  TP  /
              (TP + FN)

       F_1    Harmonic mean of precision and recall, defined as the formula (2
              * precision * recall) / (precision + recall) = 2TP / (2TP + FP +
              FN)

       bep    Break-even  point,  the arithmetic mean of precision and recall,
              defined as the formula (precision + recall) / 2




   Input specification options
       These options control how input data is specified and interpreted.   In
       its  primary  mode of operation, moses performs regression on a a table
       of input data. One column is designated as the  target,  the  remaining
       columns  are  taken as predictors.  The output of regression is a combo
       program that is a function of the predictors, reproducing the target.

       Input files should consist of ASCII data, separated by commas or white‐
       space.   The  appearance of # ; or ! in the first column denotes a com‐
       ment line; this line will be ignored. The first non-comment row, if  it
       is  also non-numeric, is taken to hold column labels. The target column
       may be specified using the -u option with a column name.  The  printing
       of column names on output is controlled with the -W1 flag.

       -i filename, --input-file=filename
              The filename specifies the input data file. The input table must
              be in 'delimiter-separated value' (DSV) format.   Valid  separa‐
              tors are comma (CSV, or comma-separated values), blanks and tabs
              (whitespace). Columns correspond to features; there is one  sam‐
              ple  per  (non-blank) row. Comment characters are hash, bang and
              semicolon (#!;) lines starting with a comment are ignored.   The
              -i  flag  may  be specified multiple times, to indicate multiple
              input files. All files must have the same number of columns.

       -u column, --target-feature=column
              The column is used as the target feature to fit.  If  no  column
              is  specified, then the first column is used.  The column may be
              numeric, or it may be a column label.  If it is numeric,  it  is
              taken  to  be  the number of the column, with column 1 being the
              left-most.  If column begins with an alphabetic character, it is
              taken to be a column label.  In this case, the first non-comment
              row of the input file must contain column labels.

       -Y column, --ignore-feature=column
              The column should be ignored, and not used  as  input.   Columns
              are specified as above.  This option may be used multiple times,
              to ignore multiple columns.

       --score-weight=column
              The column is used to weight the score for  each  row.  If  this
              option  is  not  used,  then  each  row in the table contributes
              equally to the evaluation of the accuracy of the learned  model.
              However,  if  some  rows  are  more important than others to get
              right, this option can be used to indicate those rows. The accu‐
              racy of the model will be weighted by this number, when evaluat‐
              ing the score.  A weight of zero effectively causes the  row  to
              be ignored. A negative weight enourages the system to learn mod‐
              els that get the row incorrect.  For boolean problems,  this  is
              the  same as flipping the output value.  This option can only be
              used once, and, if used, it should specify a  column  containing
              an integer or floating-point value.

       -b num, --nsamples=num
              The  number  of  samples  to be taken from the input file. Valid
              values run between 1 and the number of rows in  the  data  file;
              other  values  are  ignored.  If this option is absent, then all
              data rows are used.  If this option is present, then  the  input
              table is sampled randomly to reach this size.

       -G1, --weighted-accuracy=1
              This  option is only used for the discretize_contin_bscore (when
              --discretize-threshold is used), if enabled, then the score cor‐
              responds to weighted accuracy Useful in case of unbalanced data.


       --balance=1
              If  the table has discrete output type (like bool or enum), bal‐
              ance the resulting ctable so all classes have the same weight.


   Algorithm control options
       These options provide overall control  over  the  algorithm  execution.
       The  most important of these, for controlling behavior, are the -A, -a,
       -m, --max-time, -r, -v and -z flags.

       -a algorithm, --algo=algorithm
              Select the algorithm to apply to a single  deme.   This  is  the
              algorithm used in the 'inner loop': given a single exemplar dec‐
              orated with tunable knobs, this algorithm searches for the  best
              possible  knob settings.  Once these are found (or a timeout, or
              other termination condition is reached), control is returned  to
              the outer optimization loop.  Available algorithms include:

              hc   Hill-climbing.  There  are  two primary modes of operation;
                   each has strengths and  weaknesses  for  different  problem
                   types.   In  the  default  mode, one begins with an initial
                   collection of knob settings, called an instance.  The  set‐
                   tings  of each knob is then varied, in turn, until one set‐
                   ting is found that most improves the  score.  This  setting
                   then becomes the new instance, and the process is repeated,
                   until  no  further  improvement  is  seen.  The   resulting
                   instance  is  a  local maximum; it is returned to the outer
                   loop.

                   The alternate mode of operation is triggered by  using  the
                   -L1  flag  (usually  with  the  -T1 flag). In this case, as
                   before, all knob settings are explored, one knob at a time.
                   After  finding  the  one knob that most improves the score,
                   the algo is done, and the resulting instance is returned to
                   the  outer  loop.  If  no knob settings improved the score,
                   then all possible settings of two knobs are  explored,  and
                   then  three, etc. until improvement is found (or the allot‐
                   ted iterations are exceeded).  In this alternate mode,  the
                   local hill is not climbed to the top; instead, any improve‐
                   ment is immediately handed back  to  the  outer  loop,  for
                   another round of exemplar selection and knob-building.  For
                   certain types of problems, including  maximally  misleading
                   problems,   this  can  arrive  at  better  solutions,  more
                   quickly,  than  the  traditional  hill-climbing   algorithm
                   described above.

              sa   Simulated  annealing.   (Deprecated).  The -D flag controls
                   the size of the neighborhood that is  searched  during  the
                   early,  "high-temperature"  phase.   It  has  a significant
                   effect on the run-time performance of the algorithm.  Using
                   -D2 or -D3 is likely to provide the best performance.

                   The  current  implementation of this algorithm has numerous
                   faults, making it unlikely to work well for most problems.

              un   Univariate Bayesian dependency.


       -A score, --max-score=score
              Specifies the ideal score for a desired solution; used to termi‐
              nate  search.   If the maximum number of evaluations has not yet
              elapsed (set with the -m option), and a  candidate  solution  is
              found that has at least this score, then search is terminated.

       -m num, --max-evals=num
              Perform  no  more  than num evaluations of the scoring function.
              Default value is 10000.

       --max-time= secs
              Run the optimizer for no longer than secs  seconds.   Note  that
              timing  is  polled only in a small number of points in the algo‐
              rithm; thus, actual execution time might exceed this limit by  a
              few  seconds,  or  even  many  minutes, depending on the problem
              type.  In particular, knob-building time is  not  accounted  for
              right  away,  and  thus  problems with a long knob-building time
              will exceed this limit.  If using this option, be  sure  to  set
              the  -m  option  to  some very large value.  Default value is 42
              years.

       -n oper, --ignore-operator=oper
              Exclude the operator  oper  from  the  program  solution.   This
              option may be used several times.  Currently, oper may be one of
              div, sin, exp, log, impulse or a variable #n.  You may  need  to
              put variables under double quotes.  This option has the priority
              over the -N option.  That is, if an operator is both be included
              and ignored, then it is ignored.  This option does not work with
              ANN.

       --linear-regression=1
              When attempting  to  fit  continuous-valued  features,  restrict
              searches to linear expressions only; that is, do not use polyno‐
              mials in the fit.  Specifying  this  option  also  automatically
              disables the use of div, sin, exp and log.  Note that polynomial
              regression results in search spaces  that  grow  combinatorially
              large  in the number of input features; That is, for N features,
              a quadratic search will entail  O(N^2)  possibilities,  a  cubic
              search  will explore O(N^3) possibilities, and so on.  Thus, for
              any problem with more than dozens or a hundred features,  linear
              regression is recommended.

       -r seed, --random-seed=seed
              Use  seed as the seed value for the pseudo-random number genera‐
              tor.

       -s1, --enable-cache=1
              Enable memoization of candidate scores.  This allows the  number
              of  scoring function evaluations to be reduced, by maintaining a
              cache of recently scored candidates. If a new candidate is found
              in  the cache, that score is used, instead of a scoring function
              evaluation.   The  effectiveness  of  memoization   is   greatly
              increased by also using the -d1 flag.

       -v temperature, --complexity-temperature=temperature
              Set the "temperature" of the Boltzmann-like distribution used to
              select the next exemplar out of the metapopulation.  A  tempera‐
              ture  that  is too high or too low will make it likely that poor
              exemplars will be chosen  for  exploration,  thus  resulting  in
              excessively  long  search  times.   The  recommended temperature
              depends strongly on the type of problem being solved.  If it  is
              known  that  the problem has false maxima, and that the distance
              from the top of the false maximum to the saddle  separating  the
              false  and true maximum is H, then the recommended temerature is
              30*H.  Here, H is the 'height' or difference in score from false
              peak  to  saddle,  the  saddle  being  the highest mountain pass
              between the false and true minumum. Varying the temperature by a
              factor of 2 or 3 from this value won't affect results much.  Too
              small a temperature will typically lead to  the  system  getting
              trapped at a local maximum.

              The  demo  parity  problem  works  well  with a temperature of 5
              whereas the demo Ant trail  problem  requies  a  temperature  of
              2000.


       -z ratio, --complexity-ratio=ratio
              Fix  the  ratio of score to complexity, to be used as a penalty,
              when ranking the metapopulation for fitness. This ratio is meant
              to be used to limit the size of the search space, and, when used
              with an appropriate temperature, to avoid  gettting  trapped  in
              local maxima.

              Roughly  speaking,  the size of the search space increases expo‐
              nentially with the complexity of the combo trees being explored:
              more complex trees means more of them need to be examined.  How‐
              ever, more complex trees typically result in higher scores.   If
              an  increase  of  N bits in the complexity typically leads to an
              increase of s points of the score,  then  the  complexity  ratio
              should  be  set  to  about N/s.  In this way, the exploration of
              more complex tree is penalized by an amount  roughly  comparable
              to  the  chance  that  such complicated trees actually provide a
              better solution.

              The complexity ratio is used to calculate a scoring penalty; the
              penalty  lowers the score in proportion to the solution complex‐
              ity; specifically, the penalty is set to the complexity  divided
              by the complexity ratio.

              Setting  the  ratio  too  low causes the algorithm to ignore the
              more complex solutions, ranking them in a way so that  they  are
              not much explored. Thus, the algorithm may get trapped examining
              only the simplest solutions, which are probably inappropriate.

              Setting this ratio too high will prevent a  good  solution  from
              being  found.   In such cases, the algorithm will spend too much
              time evaluating overly-complex solutions, blithly  leaving  sim‐
              pler, better solutions unexplored.

              The  relationship  between  the  score change and the complexity
              change is very strongly data-dependent, and must (currently)  be
              manually determined (although it might be possible to measure it
              automatically).  Input data tables with lots of almost-duplicate
              data may have very low ratios; complex problems with sparse data
              may have very high ratios.  Initial recommended values would  be
              in  the  range from 1 to 5; with 3.5 as the default.  The parity
              demo problem works well with the 3.5 default, the Ant trail demo
              problem works well with 0.16.


       -Z1, --hc-crossover=1
              Controls   hill-climbing  algorithm  behavior.   If  false  (the
              default), then the entire local neighborhood of the current cen‐
              ter  instance  is explored. The highest-scoring instance is then
              chosen as the new center instance, and the process is  repeated.
              For  many  datasets, however, the highest-scoring instances tend
              to cluster together, and so an  exhaustive  search  may  not  be
              required.  When this option is specified, a handful of the high‐
              est-scoring instances are crossed-over (in the genetic sense  of
              cross-over)  to  create new instances.  Only these are evaluated
              for fitness; the exhaustive search step is  skipped.   For  many
              problem  types,  especially those with large neighborhoods (i.e.
              those with high program complexity), this can lead to an  order-
              of-magnitude  speedup,  or more.  For other problem types, espe‐
              cially those with deceptive scoring  functions,  this  can  hurt
              performance.


       --boost=1
              Enables  the use of boosting. Currently, only the AdaBoost algo‐
              rithm is implemented.  If boosting is enabled, then  the  system
              focuses on learning combo programs that correctly classify those
              values that were mis-classified in the previous round of  learn‐
              ing.  The output is a set of weighted combo trees.

              At  this time, boosting can only be used to learn binary classi‐
              fiers: that is, to learn problems in which the scorer  can  pro‐
              vide a yes/no answer.


   Boosted dynamic feature selection
       Problems  with a large number of input features (typically, hundreds or
       more) can lead to excessively long run-times, and overwhelming  amounts
       of  memory usage.  Such problems can be tamed in two different ways: by
       static feature pre-selection (i.e. by giving moses  fewer  features  to
       work  with) or by reducing the number of features considered during the
       knob-building stage.

       Recall that, as moses runs, it iterates over two loops: an  outer  loop
       where  an  exemplar  is selected and decorated with knobs, and an inner
       loop, where the knobs are 'turned', searching for  the  best  knob-set‐
       tings.   Each  knob  corresponds  to  one feature added to a particular
       location in the exemplar tree.  A single, given feature  will  then  be
       used to build a number of different knobs, each in a different place in
       the tree.  As trees get larger, there are more places in each tree that
       can  be decorated with knobs; thus, the number of knobs, and the search
       space, increases over time.  If there are M places in the tree that can
       be decorated, and there are N features, then M*N knobs will be created.
       The search space is exponential in the number of knobs;  so,  e.g.  for
       boolean problems, three knob settings are explored: present, absent and
       negated, leading to a search space of 3^(M*N).   Worse,  the  resulting
       large trees also take longer to simplify and evaluate.  Clearly, limit‐
       ing the number of knobs created can be a good strategy.

       This can  be  done  with  dynamic,  boosted  feature  selection.   When
       enabled,  a feature-selection step is performed, to find those features
       that are most likely to improve on the examplar.  These  are  the  fea‐
       tures  that  will  be strongly (anti-)correlated with incorrect answers
       from the currently-selected exemplar.  By  using  only  these  features
       during  knob  building,  the  total  number  of  knobs  can  be sharply
       decreased.  The resulting size ofthe decorated tree is smaller, and the
       total search space is much smaller.

       Dynamic,  boosted  feature  selection  differs from static feature pre-
       selection in many important ways.  In dynamic selection, the total num‐
       ber  of features available to moses is not reduced: just because a fea‐
       ture was not used in a given round of  knob-decoration  does  not  mean
       that it won't be used next time.  Dynamic feature selection also resem‐
       bles boosting, in that it focuses on fixing the wrong  answers  of  the
       current  exemplar.   By contrast, static pre-selection permanently dis‐
       cards features, making them unavailable to moses; it is also unable  to
       predict  which  features are the most likely to be useful during itera‐
       tion.

       Boosted, dynamic, feature selection is enabled with  the  --enable-fs=1
       option.   The  number of features to use during knob building is speci‐
       fied using the --fs-target-size option.  A number of  additional  flags
       control  the  behaviour  of  the feature selection algorithm; these are
       best left alone; the defaults should be adequate for almost  all  prob‐
       lems.   The  man page for the feature-selection command describes these
       in greater detail.


       --enable-fs=1
              Enable integrated feature selection.  Feature selection is  dis‐
              abled by default.

       --fs-target-size=num
              Select num features for use.  This argument is mandatory if fea‐
              ture selection is enabled.

       --fs-algo
              Choose the feature-selection algorithm.   Possible  choices  are
              simple,  inc,  smd, hc and random.  The default value is simple,
              which is the fastest and best algo.  The simple  algorithm  com‐
              putes  the  pair-wise mutual information (MI) between the target
              and each feature, and then selects the list of num features with
              the  highest  MI.   It is strongly recommended that this algo be
              used.  The smd algorithm implements  "stochastic  mutual  depen‐
              dency",  by  computing the MI between a candidate featureset and
              the target.  Starting with an initially empty  featureset,  fea‐
              tures  are randomly added one-by-one, with the MI then computed.
              Only the featureset with the highest MI is kept; the process  is
              then  repeated  until  the  featureset has the desired number of
              features in it, or the MI has stopped increasing.  Note that the
              smd algorithm effectively prevents redundant features from being
              added to the featureset.  Note that smd runs orders of magnitude
              more  slowly  than  simple, and probably does not provide better
              results.  The inc is similar to the smd  algo,  except  that  it
              adds  many  features,  all  at once, to the featureset, and then
              attempts to find  the  redundant  features  in  the  featureset,
              removing  those.  This is iteratively repeated until the desired
              number of features is obtained.  Note that inc  runs  orders  of
              magnitude more slowly than simple, and probably does not provide
              better results.  The hc algorithm runs  moses  hill-climbing  to
              discover those features most likely to appear.  The random algo‐
              rithm selects features randomly.  It is useful only for limiting
              the  numberof  knobs  created,  but  not  otherwise slantnig the
              choice of features.

       --fs-threshold=num
              Set the minimum threshold for selecting a feature.

       --fs-inc-redundant-intensity=fraction
              When using the inc algorithm, set the threshold to reject redun‐
              dant features.

       --fs-inc-target-size-epsilon=tolerance
              When using the inc algorithm, set the smallest step size used.

       --fs-inc-interaction-terms=num_terms
              When  using the inc algorithm, set the number of terms used when
              computing the joint entropy.

       --fs-hc-max-score
              TODO write description

       --fs-hc-confidence-penalty-intensity
              TODO write description

       --fs-hc-max-evals
              TODO write description

       --fs-hc-fraction-of-remaining
              TODO write description


   Large problem parameters
       Problems with a large number of features (100 and above)  often  evolve
       exemplars  with  a  complexity  of  100 or more, which in turn may have
       instances with hundreds of thousands of nearest  neighbors.   Exploring
       one  nearest  neighbor requires one evaluation of the scoring function,
       and so an exhaustive search can be prohibitive.  A partial  search  can
       often work quite well, especially when cross-over is enabled.  The fol‐
       lowing flags control such partial searches.

       --hc-max-nn-evals=num
              Controls hill-climbing algorithm behavior.  When  exploring  the
              nearest  neighborhood  of an instance, num specifies the maximum
              number of nearest neighbors to explore.  An exhaustive search of
              the nearest neighborhood is performed when the number of nearest
              neighbors is less than this value.

       --hc-fraction-of-nn=frac
              Controls hill-climbing algorithm behavior.   When exploring  the
              nearest  neighborhood  of an instance,  frac specifies the frac‐
              tion of nearest neighborhood to explore.   As  currently  imple‐
              mented,  only  an  estimate  of the nearest-neighborhood size is
              used, not the true size.  However, this estimate is accurate  to
              within  a  factor of 2.  Thus, to obtain an exhaustive search of
              the entire neighborhood, set this to 2.0 or larger.


   Algorithm tuning options
       These options allow the operation of the algorithm to be fine-tuned for
       specific  applications.   These  are "advanced" options; changing these
       from the default is likely to worsen algorithm behavior in all but cer‐
       tain special cases.

       -B effort, --reduct-knob-building-effort=effort
              Effort  allocated  for reduction during the knob-building stage.
              Valid values are in the range 0-3, with 0 standing  for  minimum
              effort, and 3 for maximum effort. Larger efforts result in demes
              with fewer knobs, thus lowering the  overall  dimension  of  the
              problem.  This  can  improve performance by effectively reducing
              the size of the problem.  The default effort is 2.

       -Ddist, --max-dist=dist
              The maximum radius of the neighborhood around  the  exemplar  to
              explore.  The default value is 4.

       -d1, --reduce-all=1
              Reduce  candidates  before  scoring  evaluation. Otherwise, only
              dominating candidates are reduced, just before  being  added  to
              the metapopulation.  This flag may be useful if scoring function
              evaluation expense depends strongly one  the  structure  of  the
              candidate.  It  is  particularly  important to specify this flag
              when memoization is enabled (with -s1).

       -E effort, --reduct-candidate-effort=effort
              Effort allocated for reduction of candidates. Valid  values  are
              in  the range 0-3, with 0 standing for minimum effort, and 3 for
              maximum effort. For certain very symmetric problems, such as the
              disjunct  problem,  greater  reduction can lead to significantly
              faster solution-finding.  The default effort is 2.

       -g num, --max-gens=num
              Create and optimize no more than num  demes.   Negative  numbers
              are  interpreted as "unlimited". By default, the number of demes
              is unlimited.

       --discard-dominated=1
              If specified, the "dominated" members of the metapopulation will
              be  discarded.   A  member  of the metapopulation is "dominated"
              when some existing member of the metapopulation scores better on
              *every*  sample in the scoring dataset. Naively, one might think
              that an individual that does worse, in every  possible  way,  is
              useless,  and can be safely thrown away.  It turns out that this
              is a bad assumption; dominated individuals,  when  selected  for
              deme  expansion,  often have far fitter off-spring than the off-
              spring of the top-scoring (dominating) members of the  metapopu‐
              lation.  Thus,  the "weak", dominated members of the metapopula‐
              tion are important for ensuring the vitality of the  metapopula‐
              tion  as a whole, and are discarded only at considerable risk to
              the future adaptability of the overall population.  Put  another
              way: specifying this flag makes it more likely that the metapop‐
              ulation will get trapped in a non-optimal local maximum.

              Note that the algorithms to compute domination are  quite  slow,
              and  so  keeping  doinated individuals has a double benefit: not
              only is the metapopulation healthier, but metapopulation manage‐
              ment runs faster.


       -L1, --hc-single-step=1
              Single-step, instead of hill-climbing to the top of a hill. That
              is, a single uphill step is taken, and the resulting best  demes
              are folded back into the metapopulation.  Solving then continues
              as usual. By default, the hill-climbing algorithm does not  sin‐
              gle-step;  it  instead  continues  to the top of the local hill,
              before folding the resulting demes back into the metapopulation.
              If  using  this  flag,  consider using the -T1 flag to allow the
              search to be widened, so that if the initial exemplar is already
              at  the  top  of  a local hill, a search is made for a different
              (taller) hill.

       -N oper, --include-only-operator=oper
              Include the operator oper, but exclude others, in the  solution.
              This option may be used several times to specify multiple opera‐
              tors.  Currently, oper may be one of plus, times, div, sin, exp,
              log,  impulse  or a variable #n.  Note that variables and opera‐
              tors are treated separately, so that including only some  opera‐
              tors  will  still include all variables, and including only some
              variables still include all operators).  You  may  need  to  put
              variables  under  double quotes.  This option does not work with
              ANN.

       -P num, --pop-size-ratio=num
              Controls amount of time spent on a deme. Default value is 20.

       -p fraction, --noise=fraction
              This option provides an alternative means of  setting  the  com‐
              plexity  ratio.  If specified, it over-rides the -z option.  For
              discrete problems, fraction can  be  interpreted  as  being  the
              fraction  of  score values that are incorrect (e.g. due to noisy
              data).  As such, only values in the range 0 < fraction < 0.5 are
              meaningful  (i.e.  less  than half of the data values are incor‐
              rect).  Typical recommended values are in the range of 0.001  to
              0.05.   For continuous-valued problems, it can be interpreted as
              the standard deviation of a  Gaussian  noise  in  the  dependent
              variable.

              For the discrete problem, the complexity ratio is related to the
              fraction  p by the explicit formula:

                  complexity_ratio = - log(p/(1-p)) / log |A|

              where |A| is the (problem-dependent) alphabet size.   See  below
              for a detailed explanation.


       -T1, --hc-widen-search=1
              Controls   hill-climbing  algorithm  behavior.   If  false  (the
              default), then deme search terminates when a  local  hilltop  is
              found. If true, then the search radius is progressively widened,
              until another termination condition is met.  Consider using  the
              -D flag to set the maximum search radius.


       --well-enough=1
              For  problems with an enumerated ('nominal') output, the learned
              combo program is always of the form cond(pred_1  value_1  pred_2
              value_2  ...  pred_n value_n else_val)  where pred_1 is a predi‐
              cate, which, if true, causes  the  output  to  be  value_1.   If
              false,  then pred_2 is tested, and so on.  If none of the predi‐
              cates evaluate to true, then the value of the cond expression is
              the else_val.  The well-enough algorithm attempts to find predi‐
              cates that maximize precision, the point being that  if  a  per‐
              fectly  precise  pred_1  can be found, then it can be left alone
              ('leave well-enough alone'), thus simplifying the  remainder  of
              the  search  problem.  Performing this evaluation is costly, and
              may lead to a slow-down, without improving overall accuracy.


   Output control options
       These options control the displayed output.  Note that, be default, the
       ten  best  solutions are printed. These are printed in penalized-score-
       sorted order, but the actual printed score  is  the  raw,  un-penalized
       score. This can lead to the printed list seeming to be in random order,
       which can occur if the penalties are high.  The penalties  can  be  too
       high, if the complexity ratio is set too low, or the temperature is set
       too low.


       -C1, --output-dominated=1
              Print all of the final metapopulation, and not just the highest-
              scoring candidates.

       -c count, --result-count=count
              The  number  of  non-dominated (best) results to return, ordered
              according to score. If negative, then all results are  returned,
              including the dominated results.

       -f filename, --log-file=filename
              Write  debug  log  traces filename. If not specified, traces are
              written to moses.log.

       -F, --log-file-dep-opt
              Write debug log traces to a filename constructed from the passed
              option  flags  and  values.  The filename will be truncated to a
              maximum of 255 characters.

       -l loglevel, --log-level=loglevel
              Specify the level of detail for debug logging.  Possible  values
              for  loglevel are NONE, ERROR, WARN, INFO, DEBUG, and FINE. Case
              does not matter.  Caution: excessive logging detail can lead  to
              significant program slowdown.  The NONE option disables log file
              creation.  This may make error debugging difficult.

       -o filename, --output-file=filename
              Write results to filename. If not specified, results are written
              to stdout.

       --python=1
              Output  the highest-scoring programs as python snippets, instead
              of combo.

       -S0, --output-score=0
              Prevent printing of the score.

       -t1, --output-bscore=1
              Print the behavioral score.

       -V1, --output-eval-number=1
              Print the number of evaluations performed.

       -W1, --output-with-labels=1
              Use named labels instead of position place-holders when printing
              candidates. For example, *("$temperature" "$entropy") instead of
              *($3 $4). This option is effective only when the data file  con‐
              tains labels in its header.

       -x1, --output-complexity=1
              Print  the  complexity  measure  of  the  model, and the scoring
              penalty.


   Precision, recall, BEP, F_1 and prerec problem types
       The prerec, recall, bep, f_one and precision problem types are used  to
       solve  binary  classification  problems:  problems where the goal is to
       sort inputs into one of two sets, while maximizing  either  the  preci‐
       sion, the sensitivity, or some other figure of merit of the test.

       In  moses,  precision  and  recall  (sensitivity) are defined as usual.
       Precision is defined as the number of true positives,  divided  by  the
       number  of  true positives plus the number of false positives.  Classi‐
       fiers with a high precision make very few  mistakes  identifying  posi‐
       tives:  they  have  very  few  or no false positives.  However, precise
       classifiers may completely fail to identify many,  if  not  most  posi‐
       tives; they just don't make mistakes when they do identify them.

       Recall,  also  known  as  sensitivity, is defined as the number of true
       positives divided by the sum of the number of true positives and  false
       negatives.   Classifiers  with high recall will identify most, or maybe
       even all positives; however, they may  also  identify  many  negatives,
       thus ruining precision.

       A  trivial way to maximize precision is to have a very low recall rate,
       and conversely, one can very easily have a good recall rate if one does
       not  mind  a  poor  precision.   Thus a common goal is to maximize one,
       while holding the other to a minimum standard.  One common  problem  is
       find  a classifier with the highest possible recall, while holding pre‐
       cision to a fixed minimum level; this  may  be  accomplished  with  the
       -Hrecall  option.   Alternately,  one may desire to maximize precision,
       while maintaining a minimum sensitivity; this may be accomplished  with
       the  -Hprerec  option.   Note that, although these two proceedures seem
       superficially similar, they can often lead  to  dramatically  different
       models  of  the  input  data.   This  is  in part because, during early
       stages, moses will choose exemplars that maximize  one  or  the  other,
       thus  causing  dramatically different parts of the solution space to be
       searched.

       A common alternative to maximizing one or  the  other  is  to  maximize
       wither  the arithmetic or the harmonic mean of the two.  The arithmetic
       mean is sometimes called the "break-even point" or BEP; it is maximized
       when  the -Hbep option is specified.  The harmonic mean is known as the
       F_1 score, it is maximized when the -Hf_one option is specified.

       moses also provides a second way of  maximizing  precision,  using  the
       -Hpre  option.  This option searches for the test with the highest pre‐
       cision, while holding the 'activation' in a bounded range.  The defini‐
       tion  of  'activation'  is idiosyncratic to moses; it is defined as the
       sum of true positives plus false positives: that is, it is the fraction
       of  rows  for which the trial combo program returned a positive answer,
       regardless of whether this was the  right  answer.   Activation  ranges
       from  0.0,  to  1.0.   It  is  never  desirable to maximize activation;
       rather, most commonly, one wants to peg activation at exactly the frac‐
       tion of positives in the training set.

       The  minimum  level  to  which  a fixed component should be held may be
       specified with the -q or --min-rand-input option.  Thus, for the  -Hre‐
       call problem, the -q flag is used to specify the minimum desired preci‐
       sion.  Similarly, for the -Hprerec problem, the  -q  flag  is  used  to
       specify  the minimum desired recall.  For the -Hpre problem, the  -w or
       --max-rand-input option should be used to make sure the activation does
       not get too high.

       The  -q   and  -w  options  also set lower and upper bounds for the BEP
       problem as well.   When maximizing BEP, the system attempts to keep the
       absolute value of the difference between precision and recall less than
       0.5.  This maximum difference can be over-ridden with the -w option.

       Adherence to the bounds is done by means of a  scoring  penalty;  combo
       programs  that  fail to lie within bounds are penalized.  The harshness
       or hardness of the penalty may be specified  by  means  of  the  -Q  or
       --alpha  option.  Values much greater than one enforce a hard boundary;
       values much less than one make for a very soft boundary.  Negative val‐
       ues are invalid.


   Contin options
       Options  that affect the usage of continuously-valued variables.  These
       options specify values that are used in a variety  of  different  ways,
       depending  on  the  chosen  problem type.  See appropriate sections for
       more details.

       -Q hardness, --alpha=hardness
              The harshness of hardness of a limit that must  be  adhered  to.
              Default 0.0 (limits disabled).

       -q num, --min-rand-input=num
              Minimum value for continuous variables. Default 0.0.

       -w num, --max-rand-input=num
              Maximum value for continuous variables.  Default 1.0.

       -R num, --discretize-threshold=num
              Split  a continuous domain into two pieces. This option maybe be
              used multiple times to split a continuous domain  into  multiple
              pieces: that is, n uses of this option will create n+1 domains.


   Demo options
       These  options pertain to the various built-in demo and example problem
       modes.  Such demo problems are  commonly  used  to  evaluate  different
       machine  learning  algorithms, and are thus included here to facilitate
       such comparison, as well as to simplify moses  regression  and  perfor‐
       mance testing.

       -H type, --problem-type=type
              A  number of demonstration problems are supported. In each case,
              the top results are printed to stdout, as a score, followed by a
              combo program.  type may be one of:

              cp    Combo program regression. The scoring function is based on
                    the combo program specified with the -y flag. That is, the
                    goal of the run is to deduce and learn the specified combo
                    program.

                    When specifying combo programs with  continuous  variables
                    in them, be sure to use the -q, -w and -b flags to specify
                    a range of input values to be sampled. In order to  deter‐
                    mine  the fitness of any candidate, it must be compared to
                    the specified combo program.  The comparison is done at  a
                    variety of different input values. If the range of sampled
                    input values is inappropriate, or if there are not  enough
                    sampled values, then the fitness function may select unex‐
                    pected, undesired candidates.

              dj    Disjunction problem. The scoring function awards a  result
                    that  is  a  boolean  disjunction (or) of N boolean-valued
                    variables.  The resulting combo program should be or($1 $2
                    ...).   The  size of the problem may be specified with the
                    -k option.

              mux   Multiplex problem. The scoring function models  a  boolean
                    digital  multiplexer, that is, an electronic circuit where
                    an "address" of n bits selects one and only one line,  out
                    of 2^n possible lines. Thus, for example, a single address
                    bit can select one of two possible lines:  the  first,  if
                    its  false, and the second, if its true. The -k option may
                    be used to specify the value of n.  The actual size of the
                    problem, measured in bits, is n+2^n and so increases expo‐
                    nentially fast.

              pa    Even parity problem.  The resulting combo program computes
                    the  parity of k bits, evaluating to true if the parity is
                    even, else evaluating to false.  The size of  the  problem
                    may be specified with the -k option.

              sr    Polynomial   regression   problem.  Given  the  polynomial
                    p(x)=x+x^2+x^3+...x^k, this searches for the shortest pro‐
                    gram  consisting of nested arithmetic operators to compute
                    p(x), given x as a free variable. The arithmetic operators
                    would  be  addition, subtraction, multiplication and divi‐
                    sion; exponentiation is not allowed in the solution.   So,
                    for  example,  using the -k2 option to specify the order-2
                    polynomial x+x^2, then the shortest combo program is *(+(1
                    $1) $1) (that is, the solution is p(x)=x(x+1) in the usual
                    arithmetical notation).


       -k size, --problem-size=size
              Specify the size of the problem.   The  interpretation  of  size
              depends on the particular problem type.

       -y prog, --combo-program=prog
              Specify  the  combo program to be learned, when used in combina‐
              tion with the -H  cp  option.   Thus,  for  example,  -H  cp  -y
              "and(\$1 \$2)" specifies that the two-input conjunction is to be
              learned.  Keep in mind that $ is a reserved  character  in  many
              shells, and thus must be escaped with a backslash in order to be
              passed to moses.

Complexity Penalty
       The speed with which the search algorithm can find a  reasonable  solu‐
       tion  is  significantly affected by the complexity ratio specified with
       the -z or -p options. This section provides the  theoretical  underpin‐
       ning  for the meaning of these flags, and how they affect the the algo‐
       rithm.  The complexity penalty has two slightly  different  interpreta‐
       tions,  depending  on whether one is considering learning a discretely-
       valued problem (i.e. boolean-valued) or a continuously-valued  problem.
       The  general  structure  of  the  argument  is broadly similar for both
       cases; they are presented below.  Similar arguments apply for classifi‐
       cation  problems  (learning to classify data into one of N categories),
       and for precision maximization.


   Discrete case
       Let M be the model to be learned (the combo program).   Let  D  be  the
       data, assumed to be a table of n inputs i_k and one output o, with each
       row in the form:

           i_1 ... i_n o

       Here, i_k is the k'th input and o the output.  In the below, we write o
       = D(x) where x=(i_1, ..., i_n) is an input data row.

       We  want to assess the probability P(M|D) of the model M conditioned on
       the data D.  In particular, we wish to maximize this,  as  it  provides
       the fitness function for the model.  According to Bayes theorem,

           P(M|D) = P(D|M) * P(M) / P(D)

       Consider the log likelihood LL(M) of M knowing D.  Since D is constant,
       we can ignore P(D), so:

           LL(M) = log(P(D|M)) + log(P(M))

       Assume each output of M on row x has probability p of being wrong.  So,

           P(D|M) = Prod_{x\in D} [p*(M(x) != D(x)) + (1-p)*(M(x) == D(x))]

       where D(x) the observed result given input x.  Then,

           log P(D|M) = Sum_{x\in D} log[p*(M(x) !=  D(x))  +  (1-p)*(M(x)  ==
       D(x))]

       Let D = D_eq \cup D_ne  where D_eq and D_ne are the sets

           D_eq = {x \in D | M(x) == D(x) }
           D_ne = {x \in D | M(x) != D(x) }

       Then

           log P(D|M) = Sum_{x\in D_ne} log(p) + Sum_{x\in D_eq} log(1-p)
                      = |D_ne| log(p) + |D_eq| log(1-p)
                      = |D_ne| log(p) + |D| log(1-p) - |D_ne| log(1-p)
                      = |D_ne| log(p/(1-p)) + |D| log(1-p)

       Here,  |D| is simply the size of set D, etc.  Assuming that p is small,
       i.e. much less than one, then, to second order in p:

          log(1-p) = -p + p^2/2 + O(p^3)

       So:

          log P(D|M) = |D_ne| log(p) - p (|D| - |D_ne|) + O(p^2)

       Next, assume P(M) is distributed according  to  Solomonoff's  Universal
       Distribution, approximated by (for now)

           P(M) = |A|^-|M|
                = exp(-|M|*log(|A|))

       where A is the alphabet of the model, |A| is the alphabet size, and |M|
       is the complexity of the model.  Note that this distribution is identi‐
       cal  to  the  Boltzmann  distribution,  for  an  inverse temperature of
       log(|A|). Putting it all together, the log-likelihood of M is:

           LL(M) = -|M|*log(|A|) + |D_ne| log(p/(1-p)) + |D| log(1-p)

       To get an expression usable for a scoring function, just bring out  the
       |D_ne| by dividing by -log(p/(1-p)), to get

           score(M) = - [ LL(M) - |D| log(1-p) ] / log(p/(1-p))
                    = -|D_ne| + |M|*log|A| / log(p/(1-p))
                    = -|D_ne| - |M| |C_coef|

       Note  that,  since p<1, that log(p) is negative, and so the second term
       is negative.  It can be understood as a complexity penalty.   That  is,
       we define the complexity penalty as

          complexity_penalty = |M| |C_coef|

       The complexity ratio, as set by the -z option, is given by

          complexity_ratio = 1 / |C_coef|

       By  contrast,  the -p option may be used to set p directly, as given in
       the formulas above.  The value of |A| is computed internally, depending
       on  the  specific  problem  type  (discrete  vs.  continuous, number of
       included-excluded operators, etc.)  The complexity of each solution  is
       also computed, using an ad-hoc complexity measure.


   Continuous case
       A  similar  argument to the above holds for the case of a continuously-
       valued observable.

       Let dP(..) be the notation for a probability density (or measure).   As
       before, start with Bayes theorem:

           dP(M|D) = dP(D|M) * P(M) / P(D)

       Since  D  is constant, one may ignore the prior P(D), and write the log
       likelihood of M knowing D as:

           LL(M) = log(dP(D|M)) + log(P(M))

       Assume the output of of the model M on input x has a Gaussian distribu‐
       tions,  of  mean  M(x) and variance V, so that dP(D|M), the probability
       density of the data D given the modem M is:

           dP(D|M) = Prod_{x\in D} (2*Pi*V)^(-1/2) exp(-(M(x)-D(x))^2/(2*V))

       As before, assume a model distribution of

           P(M) = |A|^-|M|

       where |A| is the alphabet size and |M| the  complexity  of  the  model.
       After simplification, and dropping a constant term that does not depend
       on either the model complexity or the dataset itself (the dataset  size
       is a constant), one then can deduce a scoring function:

           score(M) = -|M|*log(|A|)*2*V - Sum_{x\in D} (M(x)-D(x))^2

       As  before,  |M|*log(|A|)*2*V  can be interpreted as a scoring penalty.
       Alternately, one may interpret each  row  x  as  a  feature;  then  the
       penalty  term |M|*log(|A|)*2*V can be interpreted as an additional fea‐
       ture that must be fit.


Distributed processing
       MOSES provides two different styles of distributed processing for clus‐
       ter  computing systems.  One style is to use MPI (as implemented in the
       OpenMPI/MPICH2 systems), the second is to use SSH. The first  style  is
       best  suited  for  local area networks (LANs) and compute clusters. The
       second style allows operation over the open Internet, but is more prob‐
       lematic, as it may require manual cleanup of log files and failed jobs.
       Because MPI is easy to install and manage, it is the recommended method
       for distributing moses operation across many machines.

       When  moses  is  run  in distributed fashion, one single node, the root
       node, maintains control over a pool of workers that execute  on  remote
       nodes.   The  root  node  maintains the set of candidate solutions, and
       assigns these to the workers for additional exploration, as the workers
       become  free.  The results are automatically collected by the root, and
       are automatically merged into the candidate population.  When  termina‐
       tion  criteria are met, processing will terminate on all nodes, and the
       root node will report the merged, best results.

   MPI
       Using MPI requires a moses binary with MPI support  compiled  in,  with
       either the OpenMPI or the MPICH2 implementations.  MPI support in MOSES
       is enabled with the --mpi=1 command-line flag.  When this flag is spec‐
       ified,  moses  may be run as usual in an MPI environment.  Details will
       vary from one system configuration to  another,  but  a  typical  usage
       might resemble the following:

       mpirun -n 15 --host‐
       file mpd.hosts moses --mpi=1 -j 12 <other moses params>


       The above specifies that the run should  be  distributed  over  fifteen
       nodes,  with  the  node  names  specified  in  the mpd.hosts file.  The
       --mpi=1 flag indicates to moses that it is being run in an MPI environ‐
       ment.   The  -j12  flag tells moses to use up to twelve threads on each
       node, whenever possible; this example assumes each node  has  12  cores
       per CPU.

       To maximize CPU utilization, it seems best to specify two MPI instances
       per node.  This is because not all parts of moses are parallelized, and
       some  parts  are  subject  to  lock contention.  Thus, running multiple
       instances per node seems to be an effective way to utilize  all  avail‐
       able  compute  power  on  that node.  It is almost always the case that
       moses RAM usage is relatively small, and so RAM availability is  rarely
       a  problem.   The network utilization by moses is also very modest: the
       only network traffic is the reporting of candidate  solutions,  and  so
       the network demands are typically in the range of 1 Megabit per second,
       and are thus easily supported on an  Ethernet  connection.   The  moses
       workload  distributes  in  an 'embarrassingly parallel' fashion, and so
       there is no practical limit to scaling  on  small  and  medium  compute
       clusters.

       When performing input data regression, be sure that the input data file
       is available on all nodes.   This is most easily  achieved  by  placing
       the  input  data  file  on a shared filesystem.  Each instance of moses
       will write a log file.  In order to avoid name  collision  on  the  log
       files, the process id (PID) will automatically be incorporated into the
       log-file name when the --mpi=1 option is specified.  Log file  creation
       can  be  disabled  with  the -lNONE option; however, this is not recom‐
       mended, as it makes debugging and progress monitoring difficult.


   SSH
       The SSH style of distributed processing uses the ssh command to  estab‐
       lish communications and to control the pool of workers.  Although moses
       provides job control when the system is running normally, it  does  not
       provide  any mechanism for cleaning up after hung or failed jobs;  this
       is outside the scope of the ssh implementation.  The use of a job  man‐
       ager, such as LSF, is recommended.


       Remote  machines  are specified using the -j option, using the notation
       -j N:REMOTE_HOST.  Here, N is the number  of  threads  to  use  on  the
       machine  REMOTE_HOST.   For  instance,  one  can  enter the options -j4
       -j16:my_server.org (or -j16:user@my_server.org if one wishes to run the
       remote  job  under  a  different user name), meaning that 4 threads are
       allocated on  the  local  machine  and  16  threads  are  allocated  on
       my_server.org.   Password prompts will appear unless ssh-agent is being
       used.  The moses executable must  be  on  the  remote  machine(s),  and
       located  in  a  directory  included  in  the PATH environment variable.
       Beware that a lot of log files are going to be generated on the  remote
       machines.


TODO
       Finish documenting these algo flags: --fs-scorer


         -M  --diversity-pressure  --diversity-exponent  --diversity-normalize
       --diversity-dst --diversity-p-norm --diversity-dst2dp

       -R discretize target var

       These input flags: -G

       Interesting patterns flags: -J -K -U -X


SEE ALSO
       The eval-table man page.

       More information is available at http://wiki.opencog.org/w/MOSES

AUTHORS
       moses was written by Moshe Looks, Nil Geisweiller, and many others.

       This manual page is being written by Linas Vepstas. It is INCOMPLETE.



3.6.10                         September 6, 2014                      MOSES(1)