# MOSES man page

Man page for MOSES (raw dump).

MOSES(1)                         OpenCog Learning                         MOSES(1)

NAME
moses - meta-optimizing semantic evolutionary search solver

SYNOPSIS
moses -h|--help
moses --version
moses   [-H   problem_type]   [-i   input_filename]   [-u   target_feature]
[--enable-fs=1] [-Y ignore_feature] [-m max_evals] [--max-time seconds] [-f
logfile]  [-l loglevel] [-r random_seed] [--mpi=1] [-n ignore_operator] [-N
use_only_operator] [-o output_file] [--boost] [-A max_score] [-a algo]  [-B
knob_effort]  [-b nsamples] [-C1] [-c result_count] [-D max_dist] [-d1] [-E
reduct_effort] [-e exemplar] [-F] [-g max_gens] [-I0] [-j jobs]  [-k  prob‐
lem_size]   [-L1]   [-P   pop_size_ratio]   [-p  probability]  [--python=1]
[--python3=1] [-Q hardness] [-q min] [-R threshold] [-S0] [-s]  cache_size]
[--score-weight  column]  [-T1] [-t1] [-V1] [-v temperature] [-W1] [-w max]
[--well-enough=1] [-x1] [-y combo_program] [-Z1] [-z complexity_ratio]

DESCRIPTION
moses is the command-line wrapper to the MOSES program learning library. It
may  be  used to solve learning tasks specified with file inputs, or to run
various demonstration problems.  Given an input table of  values,  it  will
perform a regression, generating, as output, a combo program that best fits
the data. moses is multi-threaded, and can be distributed  across  multiple
machines to improve run-time performance.

EXAMPLE
As an example, the input file of comma-separated values:

a,b,out
0,0,0
0,1,0
1,0,0
1,1,1

when  run  with  the  flags -Hit -u out -W1 will generate the combo program
and($a$b). This program indicates that the third column can be modelled as
the  boolean-and  of the first two.  The -u option specifies that it is the
third column that should be modelled, and the -W1 flag indicates that  col‐
umn labels, rather than column numbers, should be printed on output.

OVERVIEW
moses  implements  program learning by using a meta-optimization algorithm.
That is, it uses two optimization algorithms, one wrapped inside the other,
to find solutions.  The outer optimization algorithm selects candidate pro‐
grams (called exemplars), and then creates similar programs  by  taking  an
exemplar  and inserting variables (called knobs) in selected locations. The
inner optimization algorithm then searches for the best possible 'knob set‐
tings',  returning  a set of programs (a deme) that provide the best fit to
the data. The outer optimization algorithm then selects new candidates, and
performs  a simplification step, reducing these to a normal form, to create
new exemplars.  The process is then  repeated  until  a  perfect  score  is
reached, a maximum number of cycles is performed, or forward progress is no

Program reduction is performed at two distinct stages: after inserting  new
knobs,  and  when  candidates are selected.  Reduction takes a program, and
transforms it into a new form, the so-called elegant  normal  form,  which,
roughly  speaking, expresses the program in the smallest, most compact form
possible.  Reduction is an important step of the algorithm, as it  prevents
the  evolved programs from turning into 'spaghetti code', avoiding needless
complexity that can cost during program evaluation.

The operation of moses can be modified by using a large variety of  options
to  control  different  parts of the above meta-algorithm.  The inner opti‐
mization step can be  done  using  one  of  several  different  algorithms,
including  hillclimbing,  simulated annealing, or univariate Bayesian opti‐
mization.  The amount of effort spent in the inner loop can be limited,  as
compared  to  how  frequently new demes and exemplars are chosen. The total
number of demes or exemplars maintained  in  the  population  may  be  con‐
trolled.   The  effort expended on program normalization at each of the two
steps can be controlled.

In addition to controlling the operation of the algorithm, many options are
used  to describe the format of the input data, and to control the printing
of the output.

moses also has a 'demo' mode: it has a number of built-in scoring functions
for  various  'well-known'  optimization problems, such as parity, disjunc‐
tion, and multiplex. Each of these typically presents a  different  set  of
challenges to the optimization algorithm, and are used to illustrate 'hard'
optimization problems.  These demo problems are useful  both  for  learning
and understanding the use of moses, and also for developers and maintainers
of moses itself, when testing new features and enhancements.

COMBO PROGRAMS
This section provides a brief overview of the combo  programming  language.
Combo is not meant so much for general use by humans for practical program‐
ming, as it is to fit the needs of program learning.  As such, it is fairly
minimal, while still being expressive enough so that all common programming
constructs are simply represented in combo.  It is not difficult to convert
combo programs into other languages, or to evaluate them directly.

In  combo,  all  programs  are  represented as program trees. That is, each
internal node in the tree is an operation and leaves are  either  constants
or variables. The set of all of the variables in a program are taken as the
arguments to the program. The program is evaluated  by  supplying  explicit
values  for the variables, and then applying operations until the root node
is reached.

By convention, variables are denoted by a dollar-sign followed by a number,
although  they  may also be named. Variables are not explicitly typed; they
are only implicitly typed during program evaluation. Thus, for example,  or
($1$2) is a combo program that represents the boolean-or of two inputs, $1 and$2.   Built-in constants include true and false; numbers may be written
with or without a decimal point.

Logical operators include and, or and not.

Arithmetic operators include +, - * and / for addition, subtraction, multi‐
plication and division. Additional arithmetic operators include  sin,  log,
exp  and  rand.  The rand operator returns a value in the range from 0.0 to
1.0.  The predicate operator 0< (greater than zero) takes a single continu‐
ous-valued expression, and returns a boolean. The operator impulse does the
opposite: it takes a single boolean-valued expression, and returns  0.0  or
1.0, depending on whether the argument evaluates to false or true.

The  cond  operator takes an alternating sequence of predicates and values,
terminated by an 'else' value.  It evaluates as  a  chain  of  if-then-else
statements,  so  that it returns the value following the first predicate to
evaluate to true.

Thus, as an example, the expression and( 0<( +($x 3)) 0<($y)) evaluates to
true if $x is greater than -3 and$y is greater than zero.

The  above  operators are built-in; combo may also be extended with custom-
defined operators, although C++ programming and recompilation  is  required
for this.

OPTIONS
Options  fall  into  four classes: those for specifying inputs, controlling
operation, printing output, and running moses in demo mode.

General options
-e exemplar
Specify an initial exemplar to use.  Each  exemplar  is  written  in
combo. May be specified multiple times.

-h, --help
Print command options, and exit.

-j num, --jobs=num
will speed the search on a multi-core machine.  This flag  may  also
be  used  to  specify  remote execution; see the section Distributed
processing below.

--version
Print program version, and exit.

Problem-type options
moses is able to handle a variety of different  'problem  types',  such  as
regression,  categorization  and  clustering,  as  well as a number of demo
problems, such as parity and factorization.  The -H option is used to spec‐
ify the problem type; the demo problem types are listed in a later section.
This section lists the various "table" problems, where the goal is to train
moses on an input table of values.

-H type, --problem-type=type
The type of problem may be one of:

it       Regression  on  an  input  table.  That is, the input table
consists of a  set  of  columns,  all  but  one  considered
'inputs',  and  one  is  considered an output.  The goal of
regression is to learn a combo program that most accurately
predicts  the  output.   For  boolean-valued and enumerated
outputs, the scoring function simply counts the  number  of
incorrect  answers, and tries to minimize this score.  That
is, this scorer  attempts  to  maximize  accuracy  (defined
below).  For  contin-valued outputs, the mean-square varia‐
tion is minimized.

pre      Regression on an input table, maximizing precision, instead
of  accuracy (that is, minimizing the number of false posi‐
tives, at the risk of sometimes failing  to  identify  true
positives).   Maximization is done while holding activation
constant.   This  scorer  is  ideal  for  learning  "domain
experts", which provide a result only when they are certain
that the result is correct.  So, for a  binary  classifier,
the  output is meant to be interpreted as "yes, certain" or
"no, don't know".  An ensemble of such combo expressions is
commmonly called a "mixture of experts".

prerec   Regression  on  an input table, maximizing precision, while
attempting to maintain the recall (sensitivity) at or above
a given level.

recall   Regression  on an input table, maximizing recall (sensitiv‐
ity) while attempting to maintain the precision at or above
a  given  level.  This scorer is most commonly used when it
is important to guarantee a  certain  level  of  precision,
even  if  it  means  rejecting most events. In medicine and
physics/radio applications,  recall  is  exactly  the  same
thing  as  sensitivity:  this  option searches for the most
sensitive test while holding to a minimum level  of  preci‐
sion.

bep      Regression  on  an  input  table, maximizing the arithmetic
mean of the precision and recall, also known as the "break-
even point" or BEP.

f_one    Regression  on an input table, maximizing the harmonic mean
of the precision and recall, that is, the F_1 score.

select   Regression on an input table, selecting  a  range  of  rows
from  a  continuous-valued distribution.  The range of rows
selected are expressed as  percentiles.  That  is,  if  the
input table has N rows, then rows that are selected will be
rows N*lower_percentile through  N*upper_percentile,  where
the  rows  are ordered in ascending rank of the output col‐
umn.  That is,  this  scorer  will  rank  (sort)  the  rows
according  to the output column, and then select those only
in the indicated range.

The lower and upper percentile can be sepcified with the -q
and    the    -w    options   (the   --min-rand-input   and
--max-rand-input options), and should lie between  0.0  and
1.0.

ip       Discovery of "interesting predicates" that select rows from
the input table. The data table is assumed to consist of  a
number of boolean-valued input columns, and a contin-valued
(floating point) target column. moses will learn predicates
that  select  the  most "interesting" subset of the rows in
the table.   The  values  in  the  output  columns  of  the
selected  rows  form a probability distribution (PDF); this
PDF is considered to be "interesting"  if  it  maximizes  a
linear combination of several different measures of the the
PDF: the Kullback-Leibler divergence, the skewness, and the
standardized Mann-Whitney U statistic.

kl       Regression  on  an input table, by maximizing the Kullback-
Leibler divergence between the distribution of the outputs.
That  is,  the  output must still be well-scored, but it is
assumed that there are many possible maxima.  (XXX???) Huh?

ann-it   Regression on an  input  table,  using  a  neural  network.
(kind-of-like  a  hidden  Markov  model-ish,  kind  of. XXX
Huh???)

For boolean-valued data tables, the scorers make use of the following  gen‐
erally-accepted, standard defintions:

TP     True-positive:  the  sum  of all rows for which the combo model pre‐
dicts T for a data-table row marked T.

FP     False-positive: the sum of all rows for which the combo  model  pre‐
dicts T for a data-table row marked F.

FN     False-negative:  the  sum of all rows for which the combo model pre‐
dicts F for a data-table row marked T.

TN     True-negative: the sum of all rows for which the  combo  model  pre‐
dicts F for a data-table row marked F.

accuracy
Defined as the formula (TP + TN) / (TP + TN + FP + FN)

activation
Defined as the formula (TP + FP) / (TP + TN + FP + FN)

precision
Defined as the formula TP / (TP + FP)

recall Also known as sensitivity, this is defined as the formula TP / (TP +
FN)

F_1    Harmonic mean of precision and recall, defined as the formula  (2  *
precision * recall) / (precision + recall) = 2TP / (2TP + FP + FN)

bep    Break-even  point,  the  arithmetic  mean  of  precision and recall,
defined as the formula (precision + recall) / 2

Input specification options
These options control how input data is specified and interpreted.  In  its
primary  mode of operation, moses performs regression on a a table of input
data. One column is designated as the target,  the  remaining  columns  are
taken as predictors.  The output of regression is a combo program that is a
function of the predictors, reproducing the target.

Input files should consist of ASCII data, separated  by  commas  or  white‐
space.   The  appearance  of # ; or ! in the first column denotes a comment
line; this line will be ignored. The first non-comment row, if it  is  also
non-numeric, is taken to hold column labels. The target column may be spec‐
ified using the -u option with a column name. The printing of column  names
on output is controlled with the -W1 flag.

-i filename, --input-file=filename
The  filename specifies the input data file. The input table must be
in 'delimiter-separated value' (DSV) format.  Valid  separators  are
comma  (CSV,  or  comma-separated  values),  blanks and tabs (white‐
space). Columns correspond to features;  there  is  one  sample  per
(non-blank)  row.  Comment  characters  are hash, bang and semicolon
(#!;) lines starting with a comment are ignored.  The -i flag may be
specified  multiple  times,  to  indicate  multiple input files. All
files must have the same number of columns.

-u column, --target-feature=column
The column is used as the target feature to fit.  If  no  column  is
specified,  then  the  first  column  is  used.   The  column may be
numeric, or it may be a column label.  If it is numeric, it is taken
to  be  the number of the column, with column 1 being the left-most.
If column begins with an alphabetic character, it is taken to  be  a
column  label.  In this case, the first non-comment row of the input
file must contain column labels.

-Y column, --ignore-feature=column
The column should be ignored, and not used as  input.   Columns  are
specified  as  above.   This  option  may be used multiple times, to
ignore multiple columns.

--score-weight=column
The column is used to weight the score for each row. If this  option
is  not  used, then each row in the table contributes equally to the
evaluation of the accuracy of the learned model.  However,  if  some
rows are more important than others to get right, this option can be
used to indicate those rows. The  accuracy  of  the  model  will  be
weighted  by  this  number,  when evaluating the score.  A weight of
zero effectively causes the row to be  ignored.  A  negative  weight
enourages  the  system  to  learn models that get the row incorrect.
For boolean problems, this is the same as flipping the output value.
This option can only be used once, and, if used, it should specify a
column containing an integer or floating-point value.

-b num, --nsamples=num
The number of samples to be taken from the input file. Valid  values
run  between 1 and the number of rows in the data file; other values
are ignored. If this option is absent, then all data rows are  used.
If  this option is present, then the input table is sampled randomly
to reach this size.

-G1, --weighted-accuracy=1
This option is only  used  for  the  discretize_contin_bscore  (when
--discretize-threshold  is  used), if enabled, then the score corre‐
sponds to weighted accuracy Useful in case of unbalanced data.

--balance=1
If the table has discrete output type (like bool or  enum),  balance
the resulting ctable so all classes have the same weight.

Algorithm control options
These  options  provide  overall control over the algorithm execution.  The
most important of these, for controlling behavior,  are  the  -A,  -a,  -m,
--max-time, -r, -v and -z flags.

-a algorithm, --algo=algorithm
Select  the  algorithm to apply to a single deme.  This is the algo‐
rithm used in the 'inner loop': given a  single  exemplar  decorated
with  tunable  knobs,  this algorithm searches for the best possible
knob settings.  Once these are found (or a timeout, or other  termi‐
nation condition is reached), control is returned to the outer opti‐
mization loop.  Available algorithms include:

hc   Hill-climbing. There are two primary modes of  operation;  each
has  strengths  and weaknesses for different problem types.  In
the default mode, one begins with an initial collection of knob
settings, called an instance. The settings of each knob is then
varied, in turn, until one setting is found that most  improves
the  score. This setting then becomes the new instance, and the
process is repeated, until no further improvement is seen.  The
resulting  instance  is  a local maximum; it is returned to the
outer loop.

The alternate mode of operation is triggered by using  the  -L1
flag  (usually with the -T1 flag). In this case, as before, all
knob settings are explored, one knob at a time.  After  finding
the  one  knob  that most improves the score, the algo is done,
and the resulting instance is returned to the outer loop. If no
knob settings improved the score, then all possible settings of
two knobs are explored, and then three, etc. until  improvement
is  found  (or  the allotted iterations are exceeded).  In this
alternate mode, the local hill  is  not  climbed  to  the  top;
instead,  any  improvement  is  immediately  handed back to the
outer loop, for another round of exemplar selection  and  knob-
building.   For  certain types of problems, including maximally
misleading problems, this can arrive at better solutions,  more
quickly, than the traditional hill-climbing algorithm described
above.

sa   Simulated annealing.  (Deprecated). The -D  flag  controls  the
size  of  the  neighborhood  that is searched during the early,
"high-temperature" phase.  It has a significant effect  on  the
run-time  performance  of  the  algorithm.  Using -D2 or -D3 is
likely to provide the best performance.

The current  implementation  of  this  algorithm  has  numerous
faults, making it unlikely to work well for most problems.

un   Univariate Bayesian dependency.

-A score, --max-score=score
Specifies  the ideal score for a desired solution; used to terminate
search.  If the maximum number of evaluations has  not  yet  elapsed
(set with the -m option), and a candidate solution is found that has
at least this score, then search is terminated.

-m num, --max-evals=num
Perform no more  than  num  evaluations  of  the  scoring  function.
Default value is 10000.

--max-time= secs
Run the optimizer for no longer than secs seconds.  Note that timing
is polled only in a small number of points in the  algorithm;  thus,
actual  execution  time might exceed this limit by a few seconds, or
even many minutes, depending on the problem  type.   In  particular,
knob-building  time  is not accounted for right away, and thus prob‐
lems with a long knob-building time  will  exceed  this  limit.   If
using  this  option, be sure to set the -m option to some very large
value.  Default value is 42 years.

-n oper, --ignore-operator=oper
Exclude the operator oper from the program  solution.   This  option
may  be used several times.  Currently, oper may be one of div, sin,
exp, log, impulse or a variable #n.  You may need to  put  variables
under  double  quotes.   This  option  has  the priority over the -N
option.  That is, if an operator is both be  included  and  ignored,
then it is ignored.  This option does not work with ANN.

--linear-regression=1
When attempting to fit continuous-valued features, restrict searches
to linear expressions only; that is, do not use polynomials  in  the
fit.   Specifying this option also automatically disables the use of
div, sin, exp and log.  Note that polynomial regression  results  in
search spaces that grow combinatorially large in the number of input
features; That is, for N features, a quadratic  search  will  entail
O(N^2)  possibilities, a cubic search will explore O(N^3) possibili‐
ties, and so on.  Thus, for any problem with more than dozens  or  a
hundred features, linear regression is recommended.

-r seed, --random-seed=seed
Use seed as the seed value for the pseudo-random number generator.

-v temperature, --complexity-temperature=temperature
Set  the  "temperature"  of  the Boltzmann-like distribution used to
select the next exemplar out of the  metapopulation.  A  temperature
that  is too high or too low will make it likely that poor exemplars
will be chosen for exploration, thus resulting in  excessively  long
search  times.   The recommended temperature depends strongly on the
type of problem being solved.  If it is known that the  problem  has
false  maxima, and that the distance from the top of the false maxi‐
mum to the saddle separating the false and true maximum is  H,  then
the recommended temerature is 30*H.  Here, H is the 'height' or dif‐
ference in score from false peak to saddle,  the  saddle  being  the
highest  mountain  pass  between the false and true minumum. Varying
the temperature by a factor of 2 or 3 from this value  won't  affect
results  much.   Too  small a temperature will typically lead to the
system getting trapped at a local maximum.

The demo parity problem works well with a temperature of  5  whereas
the demo Ant trail problem requies a temperature of 2000.

-z ratio, --complexity-ratio=ratio
Fix  the ratio of score to complexity, to be used as a penalty, when
ranking the metapopulation for fitness. This ratio is  meant  to  be
used  to  limit the size of the search space, and, when used with an
appropriate temperature, to avoid gettting trapped in local maxima.

Roughly speaking, the size of the search  space  increases  exponen‐
tially  with  the complexity of the combo trees being explored: more
complex trees means more of them need to be examined.  However, more
complex  trees typically result in higher scores.  If an increase of
N bits in the complexity typically leads to an increase of s  points
of  the score, then the complexity ratio should be set to about N/s.
In this way, the exploration of more complex tree is penalized by an
amount  roughly comparable to the chance that such complicated trees
actually provide a better solution.

The complexity ratio is used to calculate  a  scoring  penalty;  the
penalty  lowers  the score in proportion to the solution complexity;
specifically, the penalty is set to the complexity  divided  by  the
complexity ratio.

Setting  the  ratio  too low causes the algorithm to ignore the more
complex solutions, ranking them in a way so that they are  not  much
explored.  Thus,  the  algorithm  may get trapped examining only the
simplest solutions, which are probably inappropriate.

Setting this ratio too high will prevent a good solution from  being
found.  In such cases, the algorithm will spend too much time evalu‐
ating overly-complex  solutions,  blithly  leaving  simpler,  better
solutions unexplored.

The  relationship between the score change and the complexity change
is very strongly data-dependent, and must  (currently)  be  manually
determined  (although  it  might be possible to measure it automati‐
cally).  Input data tables with lots of  almost-duplicate  data  may
have  very  low  ratios;  complex problems with sparse data may have
very high ratios.  Initial recommended values would be in the  range
from 1 to 5; with 3.5 as the default.  The parity demo problem works
well with the 3.5 default, the Ant trail  demo  problem  works  well
with 0.16.

-Z1, --hc-crossover=1
Controls  hill-climbing algorithm behavior.  If false (the default),
then the entire local neighborhood of the current center instance is
explored.  The  highest-scoring  instance  is then chosen as the new
center instance, and the process is repeated.   For  many  datasets,
however, the highest-scoring instances tend to cluster together, and
so an exhaustive search may not be required.  When  this  option  is
specified,  a  handful of the highest-scoring instances are crossed-
over (in the genetic sense of cross-over) to create  new  instances.
Only  these are evaluated for fitness; the exhaustive search step is
skipped.  For many problem types, especially those with large neigh‐
borhoods (i.e. those with high program complexity), this can lead to
an order-of-magnitude speedup, or more.  For  other  problem  types,
especially  those  with  deceptive  scoring functions, this can hurt
performance.

--hc-crossover-min-neighbors=num
If crossover is enabled, and an instance is smaller than this  size,
then  an  exhaustive  search  of the neighborhood will be performed.
Otherwise, the search will be limited to a cross-over of the highest
scoring  instances.   Exhaustive searches are more accurate, but can
take much longer, when instances are large.  Recommended values  for
this option are 100 to 500. Default is 400.

--hc-crossover-pop-size=num
If  crossover  is  enabled,  this  controls the number of cross-ever
instances that are created.  A few hundred is normally sufficient to
locate contain the actual maximum. Default value is 120.

--boost=1
Enables  the use of boosting. Currently, only the AdaBoost algorithm
is implemented.  If boosting is enabled, then the system focuses  on
learning  combo  programs  that correctly classify those values that
were mis-classified in the previous round of learning.   The  output
is a set of weighted combo trees.

At this time, boosting can only be used to learn binary classifiers:
that is, to learn problems in which the scorer can provide a  yes/no

Boosted dynamic feature selection
Problems  with  a  large  number  of input features (typically, hundreds or
more) can lead to excessively long run-times, and overwhelming  amounts  of
memory  usage.  Such problems can be tamed in two different ways: by static
feature pre-selection (i.e. by giving moses fewer features to work with) or
by  reducing  the  number  of  features considered during the knob-building
stage.

Recall that, as moses runs, it iterates over two loops: an outer loop where
an  exemplar is selected and decorated with knobs, and an inner loop, where
the knobs are 'turned', searching for the best  knob-settings.   Each  knob
corresponds  to  one feature added to a particular location in the exemplar
tree.  A single, given feature will then be used to build a number of  dif‐
ferent  knobs, each in a different place in the tree.  As trees get larger,
there are more places in each tree that can be decorated with knobs;  thus,
the  number  of knobs, and the search space, increases over time.  If there
are M places in the tree that can be decorated, and there are  N  features,
then  M*N  knobs  will  be created.  The search space is exponential in the
number of knobs; so, e.g. for boolean problems,  three  knob  settings  are
explored:  present,  absent  and  negated,  leading  to  a  search space of
3^(M*N).  Worse, the resulting large trees also take longer to simplify and
evaluate.   Clearly,  limiting  the  number  of knobs created can be a good
strategy.

This can be done with dynamic, boosted feature selection.  When enabled,  a
feature-selection  step  is performed, to find those features that are most
likely to improve on the examplar.  These are the  features  that  will  be
strongly  (anti-)correlated  with  incorrect  answers  from  the currently-
selected exemplar.  By using only these features during knob building,  the
total  number  of knobs can be sharply decreased.  The resulting size ofthe
decorated tree is smaller, and the total search space is much smaller.

Dynamic, boosted feature selection differs from static  feature  pre-selec‐
tion  in  many  important  ways.  In dynamic selection, the total number of
features available to moses is not reduced: just because a feature was  not
used  in  a  given  round of knob-decoration does not mean that it won't be
used next time.  Dynamic feature selection also resembles boosting, in that
it  focuses  on  fixing the wrong answers of the current exemplar.  By con‐
trast, static pre-selection  permanently  discards  features,  making  them
unavailable  to  moses; it is also unable to predict which features are the
most likely to be useful during iteration.

Boosted, dynamic, feature  selection  is  enabled  with  the  --enable-fs=1
option.   The  number  of features to use during knob building is specified
using the --fs-target-size option.  A number of  additional  flags  control
the  behaviour  of  the  feature  selection  algorithm; these are best left
alone; the defaults should be adequate for almost all  problems.   The  man
page for the feature-selection command describes these in greater detail.

--enable-fs=1
Enable  integrated feature selection.  Feature selection is disabled
by default.

--fs-target-size=num
Select num features for use.  This argument is mandatory if  feature
selection is enabled.

--fs-algo
Choose  the  feature-selection algorithm.  Possible choices are sim‐
ple, inc, smd, hc and random.  The default value is simple, which is
the  fastest and best algo.  The simple algorithm computes the pair-
wise mutual information (MI) between the target  and  each  feature,
and  then  selects the list of num features with the highest MI.  It
is strongly recommended that this algo be used.  The  smd  algorithm
implements  "stochastic  mutual  dependency",  by  computing  the MI
between a candidate featureset and the  target.   Starting  with  an
initially  empty featureset, features are randomly added one-by-one,
with the MI then computed.  Only the featureset with the highest  MI
is  kept;  the process is then repeated until the featureset has the
desired number of features in it, or the MI has stopped  increasing.
Note  that the smd algorithm effectively prevents redundant features
from being added to the featureset.  Note that smd  runs  orders  of
magnitude  more  slowly  than  simple, and probably does not provide
better results.  The inc is similar to the smd algo, except that  it
adds  many  features,  all  at  once,  to  the  featureset, and then
attempts to find the redundant features in the featureset,  removing
those. This is iteratively repeated until the desired number of fea‐
tures is obtained.  Note that inc  runs  orders  of  magnitude  more
slowly  than  simple,  and probably does not provide better results.
The hc algorithm runs moses hill-climbing to discover those features
most  likely  to appear.  The random algorithm selects features ran‐
domly.  It is useful only for limiting the numberof  knobs  created,
but not otherwise slantnig the choice of features.

--fs-threshold=num
Set the minimum threshold for selecting a feature.

--fs-inc-redundant-intensity=fraction
When  using the inc algorithm, set the threshold to reject redundant
features.

--fs-inc-target-size-epsilon=tolerance
When using the inc algorithm, set the smallest step size used.

--fs-inc-interaction-terms=num_terms
When using the inc algorithm, set the number of terms used when com‐
puting the joint entropy.

--fs-hc-max-score
TODO write description

--fs-hc-confidence-penalty-intensity
TODO write description

--fs-hc-max-evals
TODO write description

--fs-hc-fraction-of-remaining
TODO write description

Large problem parameters
Problems with a large number of features (100 and above) often evolve exem‐
plars with a complexity of 100 or more, which in turn  may  have  instances
with  hundreds  of  thousands  of  knobs, and thus, hundreds of thouands of
nearest neighbors.  Exploring one nearest neighbor requires one  evaluation
of the scoring function, and so an exhaustive search can be prohibitive.  A
partial search can often work quite well,  especially  when  cross-over  is
enabled.  The following flags control such partial searches.

Note,  however, that usually one can obtain better results by using dynamic
feature selection, instead of using the options below to limit the  search.
The  reason  for this is that the options below cause the search to be lim‐
ited in a random fashion: the knobs to turn are selected randomly,  with  a
uniform  distribution,  without any guidance as to whether they might mae a
difference.  By contrast, dynamic feature selection also limits the  search
space, by creating knobs only for those features most likely to make a dif‐
ference.  Unless the scoring function is particularly  decietful,  limiting
the  search to the likely directions should perform better than limiting it
to a random subset.

--hc-max-nn-evals=num
Controls hill-climbing algorithm behavior.  When exploring the near‐
est neighborhood of an instance, num specifies the maximum number of
nearest neighbors to explore.  An exhaustive search of  the  nearest
neighborhood  is  performed  when the number of nearest neighbors is
less than this value.

--hc-fraction-of-nn=frac
Controls hill-climbing  algorithm  behavior.    When  exploring  the
nearest neighborhood of an instance,  frac specifies the fraction of
nearest neighborhood to explore.  As currently implemented, only  an
estimate  of  the  nearest-neighborhood  size  is used, not the true
size.  However, this estimate is accurate to within a factor  of  2.
Thus, to obtain an exhaustive search of the entire neighborhood, set
this to 2.0 or larger.

Algorithm tuning options
These options allow the operation of the algorithm  to  be  fine-tuned  for
specific  applications.   These are "advanced" options; changing these from
the default is likely to worsen algorithm behavior in all but certain  spe‐
cial cases.

-B effort, --reduct-knob-building-effort=effort
Effort  allocated  for  reduction  during  the  knob-building stage.
Valid values are in the range  0-3,  with  0  standing  for  minimum
effort,  and  3  for  maximum effort. Larger efforts result in demes
with fewer knobs, thus lowering the overall dimension of  the  prob‐
lem.  This  can improve performance by effectively reducing the size
of the problem.  The default effort is 2.

-Ddist, --max-dist=dist
The maximum radius  of  the  neighborhood  around  the  exemplar  to
explore.  The default value is 4.

-d1, --reduce-all=1
Reduce  candidates  before scoring evaluation. Otherwise, only domi‐
nating candidates are  reduced,  just  before  being  added  to  the
metapopulation.  This flag may be useful if scoring function evalua‐
tion expense depends strongly one the structure of the candidate. It
is  particularly  important to specify this flag when memoization is
enabled (with -s1).

-E effort, --reduct-candidate-effort=effort
Effort allocated for reduction of candidates. Valid  values  are  in
the range 0-3, with 0 standing for minimum effort, and 3 for maximum
effort. For certain very symmetric problems, such  as  the  disjunct
problem,  greater  reduction  can lead to significantly faster solu‐
tion-finding.  The default effort is 2.

-g num, --max-gens=num
Create and optimize no more than num demes.   Negative  numbers  are
interpreted  as  "unlimited".  By  default,  the  number of demes is
unlimited.

If specified, the "dominated" members of the metapopulation will  be
discarded.   A member of the metapopulation is "dominated" when some
existing member of the metapopulation scores better on *every*  sam‐
ple  in  the scoring dataset. Naively, one might think that an indi‐
vidual that does worse, in every possible way, is useless,  and  can
be  safely thrown away.  It turns out that this is a bad assumption;
dominated individuals, when selected for deme expansion, often  have
far  fitter off-spring than the off-spring of the top-scoring (domi‐
nating) members of the metapopulation. Thus, the  "weak",  dominated
members  of the metapopulation are important for ensuring the vital‐
ity of the metapopulation as a whole, and are discarded only at con‐
siderable risk to the future adaptability of the overall population.
Put another way: specifying this flag makes it more likely that  the
metapopulation will get trapped in a non-optimal local maximum.

Note  that  the algorithms to compute domination are quite slow, and
so keeping dominated individuals has a double benefit: not  only  is
the  metapopulation  healthier,  but  metapopulation management runs
faster.

-L1, --hc-single-step=1
Single-step, instead of hill-climbing to the top of a hill. That is,
a  single  uphill  step  is  taken, and the resulting best demes are
folded back into the  metapopulation.   Solving  then  continues  as
usual. By default, the hill-climbing algorithm does not single-step;
it instead continues to the top of the local  hill,  before  folding
the  resulting  demes  back  into the metapopulation.  If using this
flag, consider using the -T1 flag to allow the search to be widened,
so  that  if  the  initial exemplar is already at the top of a local
hill, a search is made for a different (taller) hill.

-N oper, --include-only-operator=oper
Include the operator oper, but  exclude  others,  in  the  solution.
This option may be used several times to specify multiple operators.
Currently, oper may be one of  plus,  times,  div,  sin,  exp,  log,
impulse  or  a  variable  #n.  Note that variables and operators are
treated separately, so that including only some operators will still
include  all  variables,  and  including  only  some variables still
include all operators).  You may need to put variables under  double
quotes.  This option does not work with ANN.

-P num, --pop-size-ratio=num
Controls amount of time spent on a deme. Default value is 20.

-p fraction, --noise=fraction
This  option provides an alternative means of setting the complexity
ratio.  If specified, it over-rides the  -z  option.   For  discrete
problems, fraction can be interpreted as being the fraction of score
values that are incorrect (e.g. due to noisy data).  As  such,  only
values  in  the  range  0 < fraction < 0.5 are meaningful (i.e. less
than half of the data values are  incorrect).   Typical  recommended
values  are  in  the  range of 0.001 to 0.05.  For continuous-valued
problems, it can be interpreted  as  the  standard  deviation  of  a
Gaussian noise in the dependent variable.

For  the  discrete  problem,  the complexity ratio is related to the
fraction  p by the explicit formula:

complexity_ratio = - log(p/(1-p)) / log |A|

where |A| is the (problem-dependent) alphabet size.  See below for a
detailed explanation.

-T1, --hc-widen-search=1
Controls  hill-climbing algorithm behavior.  If false (the default),
then deme search terminates when a local hilltop is found. If  true,
then  the search radius is progressively widened, until another ter‐
mination condition is met.  Consider using the -D flag  to  set  the
maximum  search  radius.   Note  that  the  size of the search space
increases exponentially with the search  radius,  so  this  kind  of
search becomes very rapidly intractible.

--well-enough=1
For  problems  with  an  enumerated  ('nominal') output, the learned
combo program is always  of  the  form  cond(pred_1  value_1  pred_2
value_2  ...  pred_n value_n else_val)  where pred_1 is a predicate,
which, if true, causes the output to be  value_1.   If  false,  then
pred_2  is tested, and so on.  If none of the predicates evaluate to
true, then the value of the cond expression is  the  else_val.   The
well-enough algorithm attempts to find predicates that maximize pre‐
cision, the point being that if a perfectly precise  pred_1  can  be
found,  then  it can be left alone ('leave well-enough alone'), thus
simplifying the remainder of the search  problem.   Performing  this
evaluation is costly, and may lead to a slow-down, without improving
overall accuracy.

Controlling RAM usage
Deploying moses on large problems can cause all available RAM  on  a  given
machine  to  be used up.  Actual RAM usage depends strongly on the dataset,
and on various tuning paramters. Typical tasks require systems with  dozens
or hundreds of gigabytes of RAM.

There  are  three primary consumers of RAM within moses: the metapopulation
(the set of combo trees being evolved), the  deme  (the  set  of  instances
describing the knob setttings, when a single combo tree is being explored),
and the score casche.  The following options can be used to limit the  size
of each of these to a more managable size.

-s num, --cache-size=num
Enable  memoization  of candidate scores.  This allows the number of
scoring function evaluations to be reduced, by maintaining  a  cache
of  recently  scored  candidates. If a new candidate is found in the
cache, that score is used, instead of a scoring function evaluation.
The  effectiveness of memoization is greatly increased by also using
the -d1 flag.  The maximum number of cached scores is  set  at  num,
which  defaults to 3000. Careful: scores can be quite large, and use
a lot of memory.

--cap-coef=num
A coefficient used to control the size of  the  metapopulation.  The
metapopulation is trimmed, by removing a random subset of the lowest
scorers, so that the maximum size is given by the formula

pop-size = cap_coef * (N + 250) * (1 + 2 * exp(-N/500))

where N is the number of metapop  iterations  (tree  expansions)  so
far.  The default value for this option is 50.

Note that the prefered way of keeping the metapopulation small is to
set a low temperature (the --complexity-temperature option).  For  a
given temperature, trees which have such a poor score that they have
a probability of less than one in a billion of  being  selected  are
automatically  discarded  from  the  metapopulation.  In most cases,
this is enough to keep the metapopulation under control. However, if
there are a vast number of trees with almost exactly the same score,
the temperature alone will not be enough to control the  metapopula‐
tion size.

If the metapopulation is growing out of control, then do verify that
the temerpature is set to an appropriatte value, before  making  use
of this flag.

--hc-resize-to-fit-ram=1
If  set  to true, this will cause moses to try to limit memory usage
so that it will fit in available RAM (and thus avoid an  OOM  kill).
The  RAM usage is controlled by automatically resizing the deme dur‐
ing the hillclimbing search. (The deme contains  instances  of  knob
settings  that were explored).  Note that the use of this option may
introduce  indeterminacy,  when  comparing   runs   from   different
machines, having different amounts of installed RAM.

Output control options
These options control the displayed output.  Note that, be default, the ten
best solutions are printed. These  are  printed  in  penalized-score-sorted
order,  but  the  actual printed score is the raw, un-penalized score. This
can lead to the printed list seeming to be in random order, which can occur
if  the penalties are high.  The penalties can be too high, if the complex‐
ity ratio is set too low, or the temperature is set too low.

-C1, --output-only-best=1
Print only the highest-scoring candidates.

-c count, --result-count=count
The number of results to return, ordered according to score. If neg‐
ative, then all results are returned.

-f filename, --log-file=filename
Write  debug log traces filename. If not specified, traces are writ‐
ten to moses.log.

-F, --log-file-dep-opt
Write debug log traces to a filename  constructed  from  the  passed
option flags and values. The filename will be truncated to a maximum
of 255 characters.

-l loglevel, --log-level=loglevel
Specify the level of detail for debug logging. Possible  values  for
loglevel are NONE, ERROR, WARN, INFO, DEBUG, and FINE. Case does not
matter.  Caution: excessive logging detail can lead  to  significant
program slowdown.  The NONE option disables log file creation.  This
may make error debugging difficult.

-o filename, --output-file=filename
Write results to filename. If not specified, results are written  to
stdout.

--python=1, --python3=1
Output  the  highest-scoring programs as python snippets, instead of
combo.

-S0, --output-score=0
Prevent printing of the score.

-t1, --output-bscore=1
Print the behavioral score.

-V1, --output-eval-number=1
Print the number of evaluations performed.

-W1, --output-with-labels=1
Use named labels instead of  position  place-holders  when  printing
candidates.  For  example,  *("$temperature" "$entropy") instead of
*($3$4). This option is effective only when the data file  contains

-x1, --output-complexity=1
Print the complexity measure of the model, and the scoring penalty.

Precision, recall, BEP, F_1 and prerec problem types
The  prerec,  recall,  bep,  f_one  and precision problem types are used to
solve binary classification problems:  problems where the goal is  to  sort
inputs  into  one  of  two sets, while maximizing either the precision, the
sensitivity, or some other figure of merit of the test.

In moses, precision and recall (sensitivity) are defined as usual.   Preci‐
sion  is  defined as the number of true positives, divided by the number of
true positives plus the number of false positives.  Classifiers with a high
precision  make very few mistakes identifying positives: they have very few
or no false positives.  However, precise classifiers may completely fail to
identify  many,  if  not most positives; they just don't make mistakes when
they do identify them.

Recall, also known as sensitivity, is defined as the number of  true  posi‐
tives  divided  by  the sum of the number of true positives and false nega‐
tives.  Classifiers with high recall will identify most, or maybe even  all
positives;  however,  they  may  also identify many negatives, thus ruining
precision.

A trivial way to maximize precision is to have a very low recall rate,  and
conversely,  one  can  very  easily have a good recall rate if one does not
mind a poor precision.  Thus a common goal is to maximize one, while  hold‐
ing  the other to a minimum standard.  One common problem is find a classi‐
fier with the highest possible recall, while holding precision to  a  fixed
minimum  level;  this may be accomplished with the -Hrecall option.  Alter‐
nately, one may desire to maximize precision, while maintaining  a  minimum
sensitivity; this may be accomplished with the -Hprerec option.  Note that,
although these two proceedures seem superficially similar, they  can  often
lead  to  dramatically different models of the input data.  This is in part
because, during early stages, moses will choose exemplars that maximize one
or  the  other,  thus  causing dramatically different parts of the solution
space to be searched.

A common alternative to maximizing one or the other is to  maximize  wither
the  arithmetic  or  the  harmonic mean of the two.  The arithmetic mean is
sometimes called the "break-even point" or BEP; it is  maximized  when  the
-Hbep option is specified.  The harmonic mean is known as the F_1 score, it
is maximized when the -Hf_one option is specified.

moses also provides a second way of maximizing precision, using  the  -Hpre
option.   This  option  searches  for  the test with the highest precision,
while holding the 'activation' in  a  bounded  range.   The  definition  of
'activation'  is  idiosyncratic  to moses; it is defined as the sum of true
positives plus false positives: that is, it is the  fraction  of  rows  for
which  the  trial  combo  program returned a positive answer, regardless of
whether this was the right answer.  Activation ranges from 0.0, to 1.0.  It
is never desirable to maximize activation; rather, most commonly, one wants
to peg activation at exactly the fraction of positives in the training set.

The minimum level to which a fixed component should be held may  be  speci‐
fied  with the -q or --min-rand-input option.  Thus, for the -Hrecall prob‐
lem, the -q flag is used to specify the minimum desired  precision.   Simi‐
larly, for the -Hprerec problem, the -q flag is used to specify the minimum
desired recall.  For the -Hpre problem, the  -w or --max-rand-input  option
should be used to make sure the activation does not get too high.

The  -q  and -w options also set lower and upper bounds for the BEP problem
as well.   When maximizing BEP, the system attempts to  keep  the  absolute
value  of  the difference between precision and recall less than 0.5.  This
maximum difference can be over-ridden with the -w option.

Adherence to the bounds is done by means of a scoring penalty;  combo  pro‐
grams that fail to lie within bounds are penalized.  The harshness or hard‐
ness of the penalty may be specified by means of the -Q or --alpha  option.
Values much greater than one enforce a hard boundary; values much less than
one make for a very soft boundary.  Negative values are invalid.

Contin options
Options that affect the  usage  of  continuously-valued  variables.   These
options  specify  values  that  are  used  in  a variety of different ways,
depending on the chosen problem type.  See appropriate  sections  for  more
details.

-Q hardness, --alpha=hardness
The  harshness  of  hardness  of  a  limit  that must be adhered to.
Default 0.0 (limits disabled).

-q num, --min-rand-input=num
Minimum value for continuous variables. Default 0.0.

-w num, --max-rand-input=num
Maximum value for continuous variables.  Default 1.0.

-R num, --discretize-threshold=num
Split a continuous domain into two pieces. This option maybe be used
multiple  times  to  split a continuous domain into multiple pieces:
that is, n uses of this option will create n+1 domains.

Demo options
These options pertain to the various  built-in  demo  and  example  problem
modes.   Such demo problems are commonly used to evaluate different machine
learning algorithms, and are thus included here to facilitate such compari‐
son, as well as to simplify moses regression and performance testing.

-H type, --problem-type=type
A  number of demonstration problems are supported. In each case, the
top results are printed to stdout, as a score, followed by  a  combo
program.  type may be one of:

cp    Combo program regression. The scoring function is based on the
combo program specified with the -y flag. That is, the goal of
the run is to deduce and learn the specified combo program.

When  specifying  combo  programs with continuous variables in
them, be sure to use the -q, -w and  -b  flags  to  specify  a
range of input values to be sampled. In order to determine the
fitness of any candidate, it must be compared to the specified
combo program.  The comparison is done at a variety of differ‐
ent input values. If the range  of  sampled  input  values  is
inappropriate, or if there are not enough sampled values, then
the fitness function may select unexpected,  undesired  candi‐
dates.

dj    Disjunction problem. The scoring function awards a result that
is a boolean disjunction (or) of N  boolean-valued  variables.
The resulting combo program should be or($1$2 ...).  The size
of the problem may be specified with the -k option.

mux   Multiplex problem. The scoring function models a boolean digi‐
tal  multiplexer,  that  is,  an  electronic  circuit where an
"address" of n bits selects one and only one line, out of  2^n
possible  lines.  Thus,  for example, a single address bit can
select one of two possible lines: the first, if its false, and
the  second, if its true. The -k option may be used to specify
the value of n.  The actual size of the problem,  measured  in
bits, is n+2^n and so increases exponentially fast.

pa    Even parity problem.  The resulting combo program computes the
parity of k bits, evaluating to true if the  parity  is  even,
else  evaluating  to  false.   The  size of the problem may be
specified with the -k option.

sr    Polynomial   regression   problem.   Given   the    polynomial
p(x)=x+x^2+x^3+...x^k,  this searches for the shortest program
consisting of nested arithmetic  operators  to  compute  p(x),
given  x as a free variable. The arithmetic operators would be
addition, subtraction, multiplication and division; exponenti‐
ation  is not allowed in the solution.  So, for example, using
the -k2 option to specify the order-2 polynomial  x+x^2,  then
the  shortest  combo  program  is  *(+(1 $1)$1) (that is, the
solution is p(x)=x(x+1) in the usual arithmetical notation).

-k size, --problem-size=size
Specify the size of the problem.  The interpretation of size depends
on the particular problem type.

-y prog, --combo-program=prog
Specify  the  combo  program to be learned, when used in combination
with the -H cp option.  Thus, for example, -H cp -y  "and(\$1 \$2)"
specifies  that the two-input conjunction is to be learned.  Keep in
mind that \$ is a reserved character in many shells, and thus must be
escaped with a backslash in order to be passed to moses.

COMPLEXITY PENALTY
The speed with which the search algorithm can find a reasonable solution is
significantly affected by the complexity ratio specified with the -z or  -p
options. This section provides the theoretical underpinning for the meaning
of these flags, and how they affect  the  the  algorithm.   The  complexity
penalty  has  two  slightly different interpretations, depending on whether
one is considering learning a discretely-valued problem (i.e.  boolean-val‐
ued)  or a continuously-valued problem.  The general structure of the argu‐
ment is broadly similar for both cases; they are presented below.   Similar
arguments apply for classification problems (learning to classify data into
one of N categories), and for precision maximization.

Discrete case
Let M be the model to be learned (the combo program).  Let D be  the  data,
assumed  to  be  a table of n inputs i_k and one output o, with each row in
the form:

i_1 ... i_n o

Here, i_k is the k'th input and o the output.  In the below, we write  o  =
D(x) where x=(i_1, ..., i_n) is an input data row.

We  want to assess the probability P(M|D) of the model M conditioned on the
data D.  In particular, we wish to maximize this, as it provides  the  fit‐
ness function for the model.  According to Bayes theorem,

P(M|D) = P(D|M) * P(M) / P(D)

Consider  the log likelihood LL(M) of M knowing D.  Since D is constant, we
can ignore P(D), so:

LL(M) = log(P(D|M)) + log(P(M))

Assume each output of M on row x has probability p of being wrong.  So,

P(D|M) = Prod_{x\in D} [p*(M(x) != D(x)) + (1-p)*(M(x) == D(x))]

where D(x) the observed result given input x.  Then,

log P(D|M) = Sum_{x\in D} log[p*(M(x) != D(x)) + (1-p)*(M(x) == D(x))]

Let D = D_eq \cup D_ne  where D_eq and D_ne are the sets

D_eq = {x \in D | M(x) == D(x) }
D_ne = {x \in D | M(x) != D(x) }

Then

log P(D|M) = Sum_{x\in D_ne} log(p) + Sum_{x\in D_eq} log(1-p)
= |D_ne| log(p) + |D_eq| log(1-p)
= |D_ne| log(p) + |D| log(1-p) - |D_ne| log(1-p)
= |D_ne| log(p/(1-p)) + |D| log(1-p)

Here, |D| is simply the size of set D, etc.  Assuming that p is small, i.e.
much less than one, then, to second order in p:

log(1-p) = -p + p^2/2 + O(p^3)

So:

log P(D|M) = |D_ne| log(p) - p (|D| - |D_ne|) + O(p^2)

Next,  assume  P(M) is distributed according to Solomonoff's Universal Dis‐
tribution, approximated by (for now)

P(M) = |A|^-|M|
= exp(-|M|*log(|A|))

where A is the alphabet of the model, |A| is the alphabet size, and |M|  is
the  complexity  of the model.  Note that this distribution is identical to
the Boltzmann distribution, for an inverse temperature of log(|A|). Putting
it all together, the log-likelihood of M is:

LL(M) = -|M|*log(|A|) + |D_ne| log(p/(1-p)) + |D| log(1-p)

To  get  an  expression  usable  for a scoring function, just bring out the
|D_ne| by dividing by -log(p/(1-p)), to get

score(M) = - [ LL(M) - |D| log(1-p) ] / log(p/(1-p))
= -|D_ne| + |M|*log|A| / log(p/(1-p))
= -|D_ne| - |M| |C_coef|

Note that, since p<1, that log(p) is negative, and so the  second  term  is
negative.   It  can  be  understood  as  a complexity penalty.  That is, we
define the complexity penalty as

complexity_penalty = |M| |C_coef|

The complexity ratio, as set by the -z option, is given by

complexity_ratio = 1 / |C_coef|

By contrast, the -p option may be used to set p directly, as given  in  the
formulas  above.  The value of |A| is computed internally, depending on the
specific problem type (discrete vs. continuous, number of included-excluded
operators,  etc.)   The complexity of each solution is also computed, using

Continuous case
A similar argument to the above holds for the case of a continuously-valued
observable.

Let  dP(..)  be  the  notation  for a probability density (or measure).  As

dP(M|D) = dP(D|M) * P(M) / P(D)

Since D is constant, one may ignore the prior P(D), and write the log like‐
lihood of M knowing D as:

LL(M) = log(dP(D|M)) + log(P(M))

Assume  the  output  of  of the model M on input x has a Gaussian distribu‐
tions, of mean M(x) and variance V, so that dP(D|M), the  probability  den‐
sity of the data D given the modem M is:

dP(D|M) = Prod_{x\in D} (2*Pi*V)^(-1/2) exp(-(M(x)-D(x))^2/(2*V))

As before, assume a model distribution of

P(M) = |A|^-|M|

where  |A| is the alphabet size and |M| the complexity of the model.  After
simplification, and dropping a constant term that does not depend on either
the  model  complexity  or  the  dataset itself (the dataset size is a con‐
stant), one then can deduce a scoring function:

score(M) = -|M|*log(|A|)*2*V - Sum_{x\in D} (M(x)-D(x))^2

As before, |M|*log(|A|)*2*V  can  be  interpreted  as  a  scoring  penalty.
Alternately,  one  may  interpret each row x as a feature; then the penalty
term |M|*log(|A|)*2*V can be interpreted as an additional feature that must
be fit.

DISTRIBUTED PROCESSING
moses  provides  two different styles of distributed processing for cluster
computing systems.  One style is to use MPI (as implemented  in  the  Open‐
MPI/MPICH2  systems),  the  second  is  to use SSH. The first style is best
suited for local area networks (LANs)  and  compute  clusters.  The  second
style  allows operation over the open Internet, but is more problematic, as
it may require manual cleanup of log files and failed jobs.  Because MPI is
easy  to  install and manage, it is the recommended method for distributing
moses operation across many machines.

When moses is run in distributed fashion, one single node, the  root  node,
maintains control over a pool of workers that execute on remote nodes.  The
root node maintains the set of candidate solutions, and  assigns  these  to
the  workers  for  additional exploration, as the workers become free.  The
results are automatically collected by  the  root,  and  are  automatically
merged  into  the candidate population.  When termination criteria are met,
processing will terminate on all nodes, and the root node will  report  the
merged, best results.

MPI
Using MPI requires a moses binary with MPI support compiled in, with either
the OpenMPI or the MPICH2 implementations.  MPI support in moses is enabled
with the --mpi=1 command-line flag.  When this flag is specified, moses may
be run as usual in an MPI environment.  Details will vary from  one  system
configuration to another, but a typical usage might resemble the following:

mpirun -n 15 --hostfile mpd.hosts moses --mpi=1 -j 12 <other moses params>

The  above specifies that the run should be distributed over fifteen nodes,
with the node names specified in the  mpd.hosts  file.   The  --mpi=1  flag
indicates  to  moses  that it is being run in an MPI environment.  The -j12
flag tells moses to use up to twelve threads on each node, whenever  possi‐
ble; this example assumes each node has 12 cores per CPU.

To maximize CPU utilization, it seems best to specify two MPI instances per
node.  This is because not all parts of moses are  parallelized,  and  some
parts are subject to lock contention.  Thus, running multiple instances per
node seems to be an effective way to utilize all available compute power on
that node.  It is almost always the case that moses RAM usage is relatively
small, and so RAM availability is rarely a problem.  The  network  utiliza‐
tion  by moses is also very modest: the only network traffic is the report‐
ing of candidate solutions, and so the network demands are typically in the
range of 1 Megabit per second, and are thus easily supported on an Ethernet
connection.  The moses workload distributes in an 'embarrassingly parallel'
fashion,  and so there is no practical limit to scaling on small and medium
compute clusters.

When performing input data regression, be sure that the input data file  is
available on all nodes.   This is most easily achieved by placing the input
data file on a shared filesystem.  Each instance of moses will write a  log
file.   In  order  to avoid name collision on the log files, the process id
(PID) will automatically be incorporated into the log-file  name  when  the
--mpi=1  option  is  specified.  Log file creation can be disabled with the
-lNONE option; however, this is not recommended, as it makes debugging  and
progress monitoring difficult.

SSH
The  SSH  style of distributed processing uses the ssh command to establish
communications and to control the pool of workers.  Although moses provides
job  control  when  the system is running normally, it does not provide any
mechanism for cleaning up after hung or failed jobs;  this is  outside  the
scope of the ssh implementation.  The use of a job manager, such as LSF, is
recommended.

Remote machines are specified using the -j option, using  the  notation  -j
N:REMOTE_HOST.   Here,  N  is  the  number of threads to use on the machine
REMOTE_HOST.    For   instance,   one   can   enter   the    options    -j4
-j16:my_server.org  (or  -j16:user@my_server.org  if  one wishes to run the
remote job under a different user name), meaning that 4 threads  are  allo‐
cated  on  the local machine and 16 threads are allocated on my_server.org.
Password prompts will appear unless ssh-agent is  being  used.   The  moses
executable  must  be  on  the remote machine(s), and located in a directory
included in the PATH environment variable.  Beware that a lot of log  files
are going to be generated on the remote machines.

DIVERSITY
moses can enforce diversity in the metapopulation, by taking into account a
diversity pressure during ranking, effectively  pushing  down  semantically
redundant  candidates  to  the  bottom  of  the  metapopulation. This has 2
effects

1. Selected exemplars for new demes are more likely to be diverse.

2. Top candidates output to the user at the end of learning are more likely
to be diverse.

The  diversity  pressure  operates  over the semantics of the candidates by
measuring the distance between the behavioral scores of the current  candi‐
date  to  be  ranked and the top candidates already ranked. The exact algo‐
rithm       is       described       in        Section        1.3        of
https://github.com/opencog/moses/blob/master/moses/moses/documenta‐
tion/diversity_report_v0.3_disclosable.pdf.

Diversity options
User options can be used to determine how much pressure to  apply  and  how
the  distance between behavioral scores should be calculated. If you do not
wish to understand much of these options but still want to  enforce  diver‐
sity  simply set a positive diversity pressure (typically around 0.01) with
option --diversity-pressure, and enable autoscale with option  --diversity-
autoscale=1.

--diversity-pressure=num
Set  a  diversity  pressure on the metapopulation. Programs behaving
similarily to others are more penalized. That value sets the  impor‐
tance  of  that  penalty.  It ranges from 0 to +inf. If --diversity-
autoscale is enabled (recommended) then it typically ranges  from  0
to 1.  Default 0.0.

--diversity-autoscale=bool
Automatically  rescale  the --diversity-pressure parameter so that a
pressure may typically range from 0 to 1, regardless of the  fitness
function.  Although  this option is disabled by default for backward
compatibility, it is highly recommended.  Default 0.

--diversity-exponent=num
Set the exponent of the generalized mean (or  sum,  if  --diversity-
normalize is set to 0) aggregating the penalties between a candidate
and the set of  all  candidates  better  than  itself  (taking  into
account  diversity).  If  the  value tends towards 0 it tends to the
geometric mean, towards +inf it tends to the max function. If  nega‐
tive or null is it the max function.  Default -1.0.

--diversity-normalize=bool
If  set  to  1  then the aggregating function is a generalized mean.
Otherwize it is a generalized sum (generalize mean * number of  ele‐
ments).   If  --diversity-exponent  is  set  to negatively then this
doesn't have any impact as the aggregating function is the max  any‐
way.  Default 1.

--diversity-dst=distance
Set  the  distance between behavioral scores, then used to determine
the uniformity penalty. 3 distances are available: p_norm,  tanimoto
and angular.  Default p_norm.

--diversity-p-norm=num
Set the parameter of the p_norm distance. A value of 1.0 corresponds
to the  Manhatan  distance.  A  value  of  2.0  corresponds  to  the
Euclidean  distance.   A  value of 0.0 or less correspond to the max
component-wise. Any other value corresponds  to  the  general  case.
Default 2.0.

--diversity-dst2dp=type
Set the type of function to convert distance into penalty. 4 options
are available: auto, inverse, complement and  power.  When  auto  is
selected  the function is selected depending on the distance, if the
distance is p_norm, then inverse is selected,  otherwise  complement
is selected.  Default auto.

TODO
Finish documenting these algo flags: --fs-scorer -M

-R discretize target var

These input flags: -G

Interesting patterns flags: -J -K -U -X