# MOSES man page

Man page for MOSES (raw dump).

 MOSES(1)                       OpenCog Learning                       MOSES(1)

NAME
moses - meta-optimizing semantic evolutionary search solver

SYNOPSIS
moses -h|--help
moses --version
moses   [-H   problem_type]  [-i  input_filename]  [-u  target_feature]
[--enable-fs=1] [-Y ignore_feature] [-m max_evals] [--max-time seconds]
[-f logfile] [-l loglevel] [-r random_seed] [--mpi=1] [-n ignore_opera‐
tor] [-N use_only_operator] [-o output_file] [--boost]  [-A  max_score]
[-a  algo]  [-B  knob_effort] [-b nsamples] [-C1] [-c result_count] [-D
max_dist] [-d1] [-E reduct_effort] [-e  exemplar]  [-F]  [-g  max_gens]
[-I0]  [-j jobs] [-k problem_size] [-L1] [-P pop_size_ratio] [-p proba‐
bility] [--python=1] [-Q hardness] [-q min] [-R threshold] [-S0]  [-s1]
[--score-weight  column]  [-T1]  [-t1] [-V1] [-v temperature] [-W1] [-w
max] [--well-enough=1] [-x1]  [-y  combo_program]  [-Z1]  [-z  complex‐
ity_ratio]

DESCRIPTION
moses  is  the  command-line  wrapper  to  the  MOSES  program learning
library. It may be used to solve learning  tasks  specified  with  file
inputs, or to run various demonstration problems.  Given an input table
of values, it will perform a regression, generating, as output, a combo
program  that  best  fits the data. moses is multi-threaded, and can be
distributed across multiple machines to improve run-time performance.

EXAMPLE
As an example, the input file of comma-separated values:

a,b,out
0,0,0
0,1,0
1,0,0
1,1,1

when run with the flags -Hit -u out -W1 will generate the combo program
and($a$b).  This  program  indicates that the third column can be mod‐
elled as the boolean-and of the first two.   The  -u  option  specifies
that  it  is the third column that should be modelled, and the -W1 flag
indicates that column labels, rather than  column  numbers,  should  be
printed on output.

OVERVIEW
moses  implements  program  learning by using a meta-optimization algo‐
rithm. That is, it uses two optimization algorithms, one wrapped inside
the other, to find solutions.  The outer optimization algorithm selects
candidate programs (called exemplars), and then  creates  similar  pro‐
grams  by  taking an exemplar and inserting variables (called knobs) in
selected locations. The inner optimization algorithm then searches  for
the best possible 'knob settings', returning a set of programs (a deme)
that provide the best fit to the data. The outer optimization algorithm
then selects new candidates, and performs a simplification step, reduc‐
ing these to a normal form, to create new exemplars.   The  process  is
then  repeated  until  a  perfect score is reached, a maximum number of
cycles is performed, or forward progress is no longer being made.

Program reduction is performed at two distinct stages: after  inserting
new  knobs,  and  when candidates are selected.  Reduction takes a pro‐
gram, and transforms it into a new form, the so-called  elegant  normal
form,  which,  roughly speaking, expresses the program in the smallest,
most compact form possible.  Reduction is  an  important  step  of  the
algorithm,  as  it  prevents  the  evolved  programs  from turning into
'spaghetti code', avoiding needless complexity  that  can  cost  during
program evaluation.

The  operation  of  moses  can  be modified by using a large variety of
options to control different parts of the  above  meta-algorithm.   The
inner  optimization  step  can  be  done using one of several different
algorithms, including hillclimbing, simulated annealing, or  univariate
Bayesian  optimization.   The  amount of effort spent in the inner loop
can be limited, as compared to how frequently new demes  and  exemplars
are  chosen.  The  total number of demes or exemplars maintained in the
population may be controlled.  The effort expended on  program  normal‐
ization at each of the two steps can be controlled.

In addition to controlling the operation of the algorithm, many options
are used to describe the format of the input data, and to  control  the
printing of the output.

moses also has a 'demo' mode: it has a number of built-in scoring func‐
tions for various 'well-known' optimization problems, such  as  parity,
disjunction,  and multiplex. Each of these typically presents a differ‐
ent set of challenges to the optimization algorithm, and  are  used  to
illustrate  'hard' optimization problems.  These demo problems are use‐
ful both for learning and understanding the use of moses, and also  for
developers  and  maintainers of moses itself, when testing new features
and enhancements.

COMBO PROGRAMS
This section provides a brief overview of the  combo  programming  lan‐
guage.   Combo is not meant so much for general use by humans for prac‐
tical programming, as it is to fit the needs of program  learning.   As
such, it is fairly minimal, while still being expressive enough so that
all common programming constructs are simply represented in combo.   It
is  not difficult to convert combo programs into other languages, or to
evaluate them directly.

In combo, all programs are represented as program trees. That is,  each
internal  node  in  the tree is an operation and leaves are either con‐
stants or variables. The set of all of the variables in a  program  are
taken as the arguments to the program. The program is evaluated by sup‐
plying explicit values for the variables, and then applying  operations
until the root node is reached.

By  convention,  variables  are  denoted by a dollar-sign followed by a
number, although they may also be named. Variables are  not  explicitly
typed;  they are only implicitly typed during program evaluation. Thus,
for example, or ($1$2) is a combo program that represents the boolean-
or  of  two  inputs,  $1 and$2.   Built-in constants include true and
false; numbers may be written with or without a decimal point.

Logical operators include and, or and not.

Arithmetic operators include +, - * and /  for  addition,  subtraction,
multiplication  and  division.  Additional arithmetic operators include
sin, log, exp and rand. The rand operator returns a value in the  range
from 0.0 to 1.0.  The predicate operator 0< (greater than zero) takes a
single continuous-valued expression, and returns a boolean. The  opera‐
tor impulse does the opposite: it takes a single boolean-valued expres‐
sion, and returns 0.0 or 1.0, depending on whether the argument  evalu‐
ates to false or true.

The  cond operator takes an alternating sequence of predicates and val‐
ues, terminated by an 'else' value.  It evaluates as  a  chain  of  if-
then-else  statements, so that it returns the value following the first
predicate to evaluate to true.

Thus, as an example, the expression and( 0<( +($x 3)) 0<($y))  evalu‐
ates to true if $x is greater than -3 and$y is greater than zero.

The  above operators are built-in; combo may also be extended with cus‐
tom-defined operators, although C++ programming  and  recompilation  is
required for this.

OPTIONS
Options  fall  into four classes: those for specifying inputs, control‐
ling operation, printing output, and running moses in demo mode.

General options
-e exemplar
Specify an initial exemplar to use. Each exemplar is written  in
combo. May be specified multiple times.

-h, --help
Print command options, and exit.

-j num, --jobs=num
Allocate  num  threads  for  deme  optimization.  Using multiple
threads will speed the search on  a  multi-core  machine.   This
flag  may also be used to specify remote execution; see the sec‐
tion Distributed processing below.

--version
Print program version, and exit.

Problem-type options
MOSES is able to handle a variety of different 'problem types', such as
regression,  categorization and clustering, as well as a number of demo
problems, such as parity and factorization.  The -H option is  used  to
specify  the problem type; the demo problem types are listed in a later
section. This section lists the various  "table"  problems,  where  the
goal is to train moses on an input table of values.

-H type, --problem-type=type
The type of problem may be one of:

it       Regression on an input table.  That is, the input table
consists of a set of columns, all  but  one  considered
'inputs', and one is considered an output.  The goal of
regression is to learn a combo program that most  accu‐
rately  predicts  the  output.   For boolean-valued and
enumerated outputs, the scoring function simply  counts
the  number of incorrect answers, and tries to minimize
this score.  That is, this scorer attempts to  maximize
accuracy  (defined  below).  For contin-valued outputs,
the mean-square variation is minimized.

pre      Regression on an  input  table,  maximizing  precision,
instead  of accuracy (that is, minimizing the number of
false positives, at the risk of  sometimes  failing  to
identify  true  positives).  Maximization is done while
holding activation constant.  This scorer is ideal  for
learning  "domain experts", which provide a result only
when they are certain that the result is correct.   So,
for  a  binary  classifier,  the  output is meant to be
interpreted as "yes, certain" or "no, don't know".   An
ensemble  of such combo expressions is commmonly called
a "mixture of experts".

prerec   Regression on an  input  table,  maximizing  precision,
while  attempting  to maintain the recall (sensitivity)
at or above a given level.

recall   Regression on an input table, maximizing recall (sensi‐
tivity)  while  attempting to maintain the precision at
or above a given level.  This scorer is  most  commonly
used  when it is important to guarantee a certain level
of precision, even if it means rejecting  most  events.
In  medicine  and physics/radio applications, recall is
exactly the same  thing  as  sensitivity:  this  option
searches for the most sensitive test while holding to a
minimum level of precision.

bep      Regression on an input table, maximizing the arithmetic
mean  of  the  precision  and recall, also known as the
"break-even point" or BEP.

f_one    Regression on an input table, maximizing  the  harmonic
mean  of  the  precision  and  recall, that is, the F_1
score.

ip       Discovery of "interesting predicates" that select  rows
from the input table. The data table is assumed to con‐
sist of a number of boolean-valued input columns, and a
contin-valued  (floating  point)  target  column. moses
will learn predicates that select the  most  "interest‐
ing"  subset  of  the rows in the table.  The values in
the output columns of the selected rows form  a  proba‐
bility distribution (PDF); this PDF is considered to be
"interesting" if it maximizes a linear  combination  of
several  different  measures  of the the PDF: the Kull‐
back-Leibler divergence, the skewness,  and  the  stan‐
dardized Mann-Whitney U statistic.

kl       Regression  on  an input table, by maximizing the Kull‐
back-Leibler divergence between the distribution of the
outputs.   That  is,  the  output  must  still be well-
scored, but it is assumed that there are many  possible
maxima.  (XXX???) Huh?

ann-it   Regression  on  an input table, using a neural network.
(kind-of-like a hidden Markov model-ish, kind  of.  XXX
Huh???)

For  boolean-valued  data tables, the scorers make use of the following
generally-accepted, standard defintions:

TP     True-positive: the sum of all rows for  which  the  combo  model
predicts T for a data-table row marked T.

FP     False-positive:  the  sum  of all rows for which the combo model
predicts T for a data-table row marked F.

FN     False-negative: the sum of all rows for which  the  combo  model
predicts F for a data-table row marked T.

TN     True-negative:  the  sum  of  all rows for which the combo model
predicts F for a data-table row marked F.

accuracy
Defined as the formula (TP + TN) / (TP + TN + FP + FN)

activation
Defined as the formula (TP + FP) / (TP + TN + FP + FN)

precision
Defined as the formula TP / (TP + FP)

recall Also known as sensitivity, this is defined as the formula  TP  /
(TP + FN)

F_1    Harmonic mean of precision and recall, defined as the formula (2
* precision * recall) / (precision + recall) = 2TP / (2TP + FP +
FN)

bep    Break-even  point,  the arithmetic mean of precision and recall,
defined as the formula (precision + recall) / 2

Input specification options
These options control how input data is specified and interpreted.   In
its  primary  mode of operation, moses performs regression on a a table
of input data. One column is designated as the  target,  the  remaining
columns  are  taken as predictors.  The output of regression is a combo
program that is a function of the predictors, reproducing the target.

Input files should consist of ASCII data, separated by commas or white‐
space.   The  appearance of # ; or ! in the first column denotes a com‐
ment line; this line will be ignored. The first non-comment row, if  it
is  also non-numeric, is taken to hold column labels. The target column
may be specified using the -u option with a column name.  The  printing
of column names on output is controlled with the -W1 flag.

-i filename, --input-file=filename
The filename specifies the input data file. The input table must
be in 'delimiter-separated value' (DSV) format.   Valid  separa‐
tors are comma (CSV, or comma-separated values), blanks and tabs
(whitespace). Columns correspond to features; there is one  sam‐
ple  per  (non-blank) row. Comment characters are hash, bang and
semicolon (#!;) lines starting with a comment are ignored.   The
-i  flag  may  be specified multiple times, to indicate multiple
input files. All files must have the same number of columns.

-u column, --target-feature=column
The column is used as the target feature to fit.  If  no  column
is  specified, then the first column is used.  The column may be
numeric, or it may be a column label.  If it is numeric,  it  is
taken  to  be  the number of the column, with column 1 being the
left-most.  If column begins with an alphabetic character, it is
taken to be a column label.  In this case, the first non-comment
row of the input file must contain column labels.

-Y column, --ignore-feature=column
The column should be ignored, and not used  as  input.   Columns
are specified as above.  This option may be used multiple times,
to ignore multiple columns.

--score-weight=column
The column is used to weight the score for  each  row.  If  this
option  is  not  used,  then  each  row in the table contributes
equally to the evaluation of the accuracy of the learned  model.
However,  if  some  rows  are  more important than others to get
right, this option can be used to indicate those rows. The accu‐
racy of the model will be weighted by this number, when evaluat‐
ing the score.  A weight of zero effectively causes the  row  to
be ignored. A negative weight enourages the system to learn mod‐
els that get the row incorrect.  For boolean problems,  this  is
the  same as flipping the output value.  This option can only be
used once, and, if used, it should specify a  column  containing
an integer or floating-point value.

-b num, --nsamples=num
The  number  of  samples  to be taken from the input file. Valid
values run between 1 and the number of rows in  the  data  file;
other  values  are  ignored.  If this option is absent, then all
data rows are used.  If this option is present, then  the  input
table is sampled randomly to reach this size.

-G1, --weighted-accuracy=1
This  option is only used for the discretize_contin_bscore (when
--discretize-threshold is used), if enabled, then the score cor‐
responds to weighted accuracy Useful in case of unbalanced data.

--balance=1
If  the table has discrete output type (like bool or enum), bal‐
ance the resulting ctable so all classes have the same weight.

Algorithm control options
These options provide overall control  over  the  algorithm  execution.
The  most important of these, for controlling behavior, are the -A, -a,
-m, --max-time, -r, -v and -z flags.

-a algorithm, --algo=algorithm
Select the algorithm to apply to a single  deme.   This  is  the
algorithm used in the 'inner loop': given a single exemplar dec‐
orated with tunable knobs, this algorithm searches for the  best
possible  knob settings.  Once these are found (or a timeout, or
other termination condition is reached), control is returned  to
the outer optimization loop.  Available algorithms include:

hc   Hill-climbing.  There  are  two primary modes of operation;
each has strengths and  weaknesses  for  different  problem
types.   In  the  default  mode, one begins with an initial
collection of knob settings, called an instance.  The  set‐
tings  of each knob is then varied, in turn, until one set‐
ting is found that most improves the  score.  This  setting
then becomes the new instance, and the process is repeated,
until  no  further  improvement  is  seen.  The   resulting
instance  is  a  local maximum; it is returned to the outer
loop.

The alternate mode of operation is triggered by  using  the
-L1  flag  (usually  with  the  -T1 flag). In this case, as
before, all knob settings are explored, one knob at a time.
After  finding  the  one knob that most improves the score,
the algo is done, and the resulting instance is returned to
the  outer  loop.  If  no knob settings improved the score,
then all possible settings of two knobs are  explored,  and
then  three, etc. until improvement is found (or the allot‐
ted iterations are exceeded).  In this alternate mode,  the
local hill is not climbed to the top; instead, any improve‐
ment is immediately handed back  to  the  outer  loop,  for
another round of exemplar selection and knob-building.  For
certain types of problems, including  maximally  misleading
problems,   this  can  arrive  at  better  solutions,  more
quickly,  than  the  traditional  hill-climbing   algorithm
described above.

sa   Simulated  annealing.   (Deprecated).  The -D flag controls
the size of the neighborhood that is  searched  during  the
early,  "high-temperature"  phase.   It  has  a significant
effect on the run-time performance of the algorithm.  Using
-D2 or -D3 is likely to provide the best performance.

The  current  implementation of this algorithm has numerous
faults, making it unlikely to work well for most problems.

un   Univariate Bayesian dependency.

-A score, --max-score=score
Specifies the ideal score for a desired solution; used to termi‐
nate  search.   If the maximum number of evaluations has not yet
elapsed (set with the -m option), and a  candidate  solution  is
found that has at least this score, then search is terminated.

-m num, --max-evals=num
Perform  no  more  than num evaluations of the scoring function.
Default value is 10000.

--max-time= secs
Run the optimizer for no longer than secs  seconds.   Note  that
timing  is  polled only in a small number of points in the algo‐
rithm; thus, actual execution time might exceed this limit by  a
few  seconds,  or  even  many  minutes, depending on the problem
type.  In particular, knob-building time is  not  accounted  for
right  away,  and  thus  problems with a long knob-building time
will exceed this limit.  If using this option, be  sure  to  set
the  -m  option  to  some very large value.  Default value is 42
years.

-n oper, --ignore-operator=oper
Exclude the operator  oper  from  the  program  solution.   This
option may be used several times.  Currently, oper may be one of
div, sin, exp, log, impulse or a variable #n.  You may  need  to
put variables under double quotes.  This option has the priority
over the -N option.  That is, if an operator is both be included
and ignored, then it is ignored.  This option does not work with
ANN.

--linear-regression=1
When attempting  to  fit  continuous-valued  features,  restrict
searches to linear expressions only; that is, do not use polyno‐
mials in the fit.  Specifying  this  option  also  automatically
disables the use of div, sin, exp and log.  Note that polynomial
regression results in search spaces  that  grow  combinatorially
large  in the number of input features; That is, for N features,
a quadratic search will entail  O(N^2)  possibilities,  a  cubic
search  will explore O(N^3) possibilities, and so on.  Thus, for
any problem with more than dozens or a hundred features,  linear
regression is recommended.

-r seed, --random-seed=seed
Use  seed as the seed value for the pseudo-random number genera‐
tor.

-s1, --enable-cache=1
Enable memoization of candidate scores.  This allows the  number
of  scoring function evaluations to be reduced, by maintaining a
cache of recently scored candidates. If a new candidate is found
in  the cache, that score is used, instead of a scoring function
evaluation.   The  effectiveness  of  memoization   is   greatly
increased by also using the -d1 flag.

-v temperature, --complexity-temperature=temperature
Set the "temperature" of the Boltzmann-like distribution used to
select the next exemplar out of the metapopulation.  A  tempera‐
ture  that  is too high or too low will make it likely that poor
exemplars will be chosen  for  exploration,  thus  resulting  in
excessively  long  search  times.   The  recommended temperature
depends strongly on the type of problem being solved.  If it  is
known  that  the problem has false maxima, and that the distance
from the top of the false maximum to the saddle  separating  the
false  and true maximum is H, then the recommended temerature is
30*H.  Here, H is the 'height' or difference in score from false
between the false and true minumum. Varying the temperature by a
factor of 2 or 3 from this value won't affect results much.  Too
small a temperature will typically lead to  the  system  getting
trapped at a local maximum.

The  demo  parity  problem  works  well  with a temperature of 5
whereas the demo Ant trail  problem  requies  a  temperature  of
2000.

-z ratio, --complexity-ratio=ratio
Fix  the  ratio of score to complexity, to be used as a penalty,
when ranking the metapopulation for fitness. This ratio is meant
to be used to limit the size of the search space, and, when used
with an appropriate temperature, to avoid  gettting  trapped  in
local maxima.

Roughly  speaking,  the size of the search space increases expo‐
nentially with the complexity of the combo trees being explored:
more complex trees means more of them need to be examined.  How‐
ever, more complex trees typically result in higher scores.   If
an  increase  of  N bits in the complexity typically leads to an
increase of s points of the score,  then  the  complexity  ratio
should  be  set  to  about N/s.  In this way, the exploration of
more complex tree is penalized by an amount  roughly  comparable
to  the  chance  that  such complicated trees actually provide a
better solution.

The complexity ratio is used to calculate a scoring penalty; the
penalty  lowers the score in proportion to the solution complex‐
ity; specifically, the penalty is set to the complexity  divided
by the complexity ratio.

Setting  the  ratio  too  low causes the algorithm to ignore the
more complex solutions, ranking them in a way so that  they  are
not much explored. Thus, the algorithm may get trapped examining
only the simplest solutions, which are probably inappropriate.

Setting this ratio too high will prevent a  good  solution  from
being  found.   In such cases, the algorithm will spend too much
time evaluating overly-complex solutions, blithly  leaving  sim‐
pler, better solutions unexplored.

The  relationship  between  the  score change and the complexity
change is very strongly data-dependent, and must (currently)  be
manually determined (although it might be possible to measure it
automatically).  Input data tables with lots of almost-duplicate
data may have very low ratios; complex problems with sparse data
may have very high ratios.  Initial recommended values would  be
in  the  range from 1 to 5; with 3.5 as the default.  The parity
demo problem works well with the 3.5 default, the Ant trail demo
problem works well with 0.16.

-Z1, --hc-crossover=1
Controls   hill-climbing  algorithm  behavior.   If  false  (the
default), then the entire local neighborhood of the current cen‐
ter  instance  is explored. The highest-scoring instance is then
chosen as the new center instance, and the process is  repeated.
For  many  datasets, however, the highest-scoring instances tend
to cluster together, and so an  exhaustive  search  may  not  be
required.  When this option is specified, a handful of the high‐
est-scoring instances are crossed-over (in the genetic sense  of
cross-over)  to  create new instances.  Only these are evaluated
for fitness; the exhaustive search step is  skipped.   For  many
problem  types,  especially those with large neighborhoods (i.e.
those with high program complexity), this can lead to an  order-
of-magnitude  speedup,  or more.  For other problem types, espe‐
cially those with deceptive scoring  functions,  this  can  hurt
performance.

--boost=1
Enables  the use of boosting. Currently, only the AdaBoost algo‐
rithm is implemented.  If boosting is enabled, then  the  system
focuses on learning combo programs that correctly classify those
values that were mis-classified in the previous round of  learn‐
ing.  The output is a set of weighted combo trees.

At  this time, boosting can only be used to learn binary classi‐
fiers: that is, to learn problems in which the scorer  can  pro‐

Boosted dynamic feature selection
Problems  with a large number of input features (typically, hundreds or
more) can lead to excessively long run-times, and overwhelming  amounts
of  memory usage.  Such problems can be tamed in two different ways: by
static feature pre-selection (i.e. by giving moses  fewer  features  to
work  with) or by reducing the number of features considered during the
knob-building stage.

Recall that, as moses runs, it iterates over two loops: an  outer  loop
where  an  exemplar  is selected and decorated with knobs, and an inner
loop, where the knobs are 'turned', searching for  the  best  knob-set‐
tings.   Each  knob  corresponds  to  one feature added to a particular
location in the exemplar tree.  A single, given feature  will  then  be
used to build a number of different knobs, each in a different place in
the tree.  As trees get larger, there are more places in each tree that
can  be decorated with knobs; thus, the number of knobs, and the search
space, increases over time.  If there are M places in the tree that can
be decorated, and there are N features, then M*N knobs will be created.
The search space is exponential in the number of knobs;  so,  e.g.  for
boolean problems, three knob settings are explored: present, absent and
negated, leading to a search space of 3^(M*N).   Worse,  the  resulting
large trees also take longer to simplify and evaluate.  Clearly, limit‐
ing the number of knobs created can be a good strategy.

This can  be  done  with  dynamic,  boosted  feature  selection.   When
enabled,  a feature-selection step is performed, to find those features
that are most likely to improve on the examplar.  These  are  the  fea‐
tures  that  will  be strongly (anti-)correlated with incorrect answers
from the currently-selected exemplar.  By  using  only  these  features
during  knob  building,  the  total  number  of  knobs  can  be sharply
decreased.  The resulting size ofthe decorated tree is smaller, and the
total search space is much smaller.

Dynamic,  boosted  feature  selection  differs from static feature pre-
selection in many important ways.  In dynamic selection, the total num‐
ber  of features available to moses is not reduced: just because a fea‐
ture was not used in a given round of  knob-decoration  does  not  mean
that it won't be used next time.  Dynamic feature selection also resem‐
bles boosting, in that it focuses on fixing the wrong  answers  of  the
current  exemplar.   By contrast, static pre-selection permanently dis‐
cards features, making them unavailable to moses; it is also unable  to
predict  which  features are the most likely to be useful during itera‐
tion.

Boosted, dynamic, feature selection is enabled with  the  --enable-fs=1
option.   The  number of features to use during knob building is speci‐
fied using the --fs-target-size option.  A number of  additional  flags
control  the  behaviour  of  the feature selection algorithm; these are
best left alone; the defaults should be adequate for almost  all  prob‐
lems.   The  man page for the feature-selection command describes these
in greater detail.

--enable-fs=1
Enable integrated feature selection.  Feature selection is  dis‐
abled by default.

--fs-target-size=num
Select num features for use.  This argument is mandatory if fea‐
ture selection is enabled.

--fs-algo
Choose the feature-selection algorithm.   Possible  choices  are
simple,  inc,  smd, hc and random.  The default value is simple,
which is the fastest and best algo.  The simple  algorithm  com‐
putes  the  pair-wise mutual information (MI) between the target
and each feature, and then selects the list of num features with
the  highest  MI.   It is strongly recommended that this algo be
used.  The smd algorithm implements  "stochastic  mutual  depen‐
dency",  by  computing the MI between a candidate featureset and
the target.  Starting with an initially empty  featureset,  fea‐
tures  are randomly added one-by-one, with the MI then computed.
Only the featureset with the highest MI is kept; the process  is
then  repeated  until  the  featureset has the desired number of
features in it, or the MI has stopped increasing.  Note that the
smd algorithm effectively prevents redundant features from being
added to the featureset.  Note that smd runs orders of magnitude
more  slowly  than  simple, and probably does not provide better
results.  The inc is similar to the smd  algo,  except  that  it
adds  many  features,  all  at once, to the featureset, and then
attempts to find  the  redundant  features  in  the  featureset,
removing  those.  This is iteratively repeated until the desired
number of features is obtained.  Note that inc  runs  orders  of
magnitude more slowly than simple, and probably does not provide
better results.  The hc algorithm runs  moses  hill-climbing  to
discover those features most likely to appear.  The random algo‐
rithm selects features randomly.  It is useful only for limiting
the  numberof  knobs  created,  but  not  otherwise slantnig the
choice of features.

--fs-threshold=num
Set the minimum threshold for selecting a feature.

--fs-inc-redundant-intensity=fraction
When using the inc algorithm, set the threshold to reject redun‐
dant features.

--fs-inc-target-size-epsilon=tolerance
When using the inc algorithm, set the smallest step size used.

--fs-inc-interaction-terms=num_terms
When  using the inc algorithm, set the number of terms used when
computing the joint entropy.

--fs-hc-max-score
TODO write description

--fs-hc-confidence-penalty-intensity
TODO write description

--fs-hc-max-evals
TODO write description

--fs-hc-fraction-of-remaining
TODO write description

Large problem parameters
Problems with a large number of features (100 and above)  often  evolve
exemplars  with  a  complexity  of  100 or more, which in turn may have
instances with hundreds of thousands of nearest  neighbors.   Exploring
one  nearest  neighbor requires one evaluation of the scoring function,
and so an exhaustive search can be prohibitive.  A partial  search  can
often work quite well, especially when cross-over is enabled.  The fol‐
lowing flags control such partial searches.

--hc-max-nn-evals=num
Controls hill-climbing algorithm behavior.  When  exploring  the
nearest  neighborhood  of an instance, num specifies the maximum
number of nearest neighbors to explore.  An exhaustive search of
the nearest neighborhood is performed when the number of nearest
neighbors is less than this value.

--hc-fraction-of-nn=frac
Controls hill-climbing algorithm behavior.   When exploring  the
nearest  neighborhood  of an instance,  frac specifies the frac‐
tion of nearest neighborhood to explore.   As  currently  imple‐
mented,  only  an  estimate  of the nearest-neighborhood size is
used, not the true size.  However, this estimate is accurate  to
within  a  factor of 2.  Thus, to obtain an exhaustive search of
the entire neighborhood, set this to 2.0 or larger.

Algorithm tuning options
These options allow the operation of the algorithm to be fine-tuned for
specific  applications.   These  are "advanced" options; changing these
from the default is likely to worsen algorithm behavior in all but cer‐
tain special cases.

-B effort, --reduct-knob-building-effort=effort
Effort  allocated  for reduction during the knob-building stage.
Valid values are in the range 0-3, with 0 standing  for  minimum
effort, and 3 for maximum effort. Larger efforts result in demes
with fewer knobs, thus lowering the  overall  dimension  of  the
problem.  This  can  improve performance by effectively reducing
the size of the problem.  The default effort is 2.

-Ddist, --max-dist=dist
The maximum radius of the neighborhood around  the  exemplar  to
explore.  The default value is 4.

-d1, --reduce-all=1
Reduce  candidates  before  scoring  evaluation. Otherwise, only
dominating candidates are reduced, just before  being  added  to
the metapopulation.  This flag may be useful if scoring function
evaluation expense depends strongly one  the  structure  of  the
candidate.  It  is  particularly  important to specify this flag
when memoization is enabled (with -s1).

-E effort, --reduct-candidate-effort=effort
Effort allocated for reduction of candidates. Valid  values  are
in  the range 0-3, with 0 standing for minimum effort, and 3 for
maximum effort. For certain very symmetric problems, such as the
disjunct  problem,  greater  reduction can lead to significantly
faster solution-finding.  The default effort is 2.

-g num, --max-gens=num
Create and optimize no more than num  demes.   Negative  numbers
are  interpreted as "unlimited". By default, the number of demes
is unlimited.

If specified, the "dominated" members of the metapopulation will
be  discarded.   A  member  of the metapopulation is "dominated"
when some existing member of the metapopulation scores better on
*every*  sample in the scoring dataset. Naively, one might think
that an individual that does worse, in every  possible  way,  is
useless,  and can be safely thrown away.  It turns out that this
is a bad assumption; dominated individuals,  when  selected  for
deme  expansion,  often have far fitter off-spring than the off-
spring of the top-scoring (dominating) members of the  metapopu‐
lation.  Thus,  the "weak", dominated members of the metapopula‐
tion are important for ensuring the vitality of the  metapopula‐
tion  as a whole, and are discarded only at considerable risk to
the future adaptability of the overall population.  Put  another
way: specifying this flag makes it more likely that the metapop‐
ulation will get trapped in a non-optimal local maximum.

Note that the algorithms to compute domination are  quite  slow,
and  so  keeping  doinated individuals has a double benefit: not
only is the metapopulation healthier, but metapopulation manage‐
ment runs faster.

-L1, --hc-single-step=1
Single-step, instead of hill-climbing to the top of a hill. That
is, a single uphill step is taken, and the resulting best  demes
are folded back into the metapopulation.  Solving then continues
as usual. By default, the hill-climbing algorithm does not  sin‐
gle-step;  it  instead  continues  to the top of the local hill,
before folding the resulting demes back into the metapopulation.
If  using  this  flag,  consider using the -T1 flag to allow the
search to be widened, so that if the initial exemplar is already
at  the  top  of  a local hill, a search is made for a different
(taller) hill.

-N oper, --include-only-operator=oper
Include the operator oper, but exclude others, in the  solution.
This option may be used several times to specify multiple opera‐
tors.  Currently, oper may be one of plus, times, div, sin, exp,
log,  impulse  or a variable #n.  Note that variables and opera‐
tors are treated separately, so that including only some  opera‐
tors  will  still include all variables, and including only some
variables still include all operators).  You  may  need  to  put
variables  under  double quotes.  This option does not work with
ANN.

-P num, --pop-size-ratio=num
Controls amount of time spent on a deme. Default value is 20.

-p fraction, --noise=fraction
This option provides an alternative means of  setting  the  com‐
plexity  ratio.  If specified, it over-rides the -z option.  For
discrete problems, fraction can  be  interpreted  as  being  the
fraction  of  score values that are incorrect (e.g. due to noisy
data).  As such, only values in the range 0 < fraction < 0.5 are
meaningful  (i.e.  less  than half of the data values are incor‐
rect).  Typical recommended values are in the range of 0.001  to
0.05.   For continuous-valued problems, it can be interpreted as
the standard deviation of a  Gaussian  noise  in  the  dependent
variable.

For the discrete problem, the complexity ratio is related to the
fraction  p by the explicit formula:

complexity_ratio = - log(p/(1-p)) / log |A|

where |A| is the (problem-dependent) alphabet size.   See  below
for a detailed explanation.

-T1, --hc-widen-search=1
Controls   hill-climbing  algorithm  behavior.   If  false  (the
default), then deme search terminates when a  local  hilltop  is
found. If true, then the search radius is progressively widened,
until another termination condition is met.  Consider using  the
-D flag to set the maximum search radius.

--well-enough=1
For  problems with an enumerated ('nominal') output, the learned
combo program is always of the form cond(pred_1  value_1  pred_2
value_2  ...  pred_n value_n else_val)  where pred_1 is a predi‐
cate, which, if true, causes  the  output  to  be  value_1.   If
false,  then pred_2 is tested, and so on.  If none of the predi‐
cates evaluate to true, then the value of the cond expression is
the else_val.  The well-enough algorithm attempts to find predi‐
cates that maximize precision, the point being that  if  a  per‐
fectly  precise  pred_1  can be found, then it can be left alone
('leave well-enough alone'), thus simplifying the  remainder  of
the  search  problem.  Performing this evaluation is costly, and
may lead to a slow-down, without improving overall accuracy.

Output control options
These options control the displayed output.  Note that, be default, the
ten  best  solutions are printed. These are printed in penalized-score-
sorted order, but the actual printed score  is  the  raw,  un-penalized
score. This can lead to the printed list seeming to be in random order,
which can occur if the penalties are high.  The penalties  can  be  too
high, if the complexity ratio is set too low, or the temperature is set
too low.

-C1, --output-dominated=1
Print all of the final metapopulation, and not just the highest-
scoring candidates.

-c count, --result-count=count
The  number  of  non-dominated (best) results to return, ordered
according to score. If negative, then all results are  returned,
including the dominated results.

-f filename, --log-file=filename
Write  debug  log  traces filename. If not specified, traces are
written to moses.log.

-F, --log-file-dep-opt
Write debug log traces to a filename constructed from the passed
option  flags  and  values.  The filename will be truncated to a
maximum of 255 characters.

-l loglevel, --log-level=loglevel
Specify the level of detail for debug logging.  Possible  values
for  loglevel are NONE, ERROR, WARN, INFO, DEBUG, and FINE. Case
does not matter.  Caution: excessive logging detail can lead  to
significant program slowdown.  The NONE option disables log file
creation.  This may make error debugging difficult.

-o filename, --output-file=filename
Write results to filename. If not specified, results are written
to stdout.

--python=1
Output  the highest-scoring programs as python snippets, instead
of combo.

-S0, --output-score=0
Prevent printing of the score.

-t1, --output-bscore=1
Print the behavioral score.

-V1, --output-eval-number=1
Print the number of evaluations performed.

-W1, --output-with-labels=1
Use named labels instead of position place-holders when printing
candidates. For example, *("$temperature" "$entropy") instead of
*($3$4). This option is effective only when the data file  con‐

-x1, --output-complexity=1
Print  the  complexity  measure  of  the  model, and the scoring
penalty.

Precision, recall, BEP, F_1 and prerec problem types
The prerec, recall, bep, f_one and precision problem types are used  to
solve  binary  classification  problems:  problems where the goal is to
sort inputs into one of two sets, while maximizing  either  the  preci‐
sion, the sensitivity, or some other figure of merit of the test.

In  moses,  precision  and  recall  (sensitivity) are defined as usual.
Precision is defined as the number of true positives,  divided  by  the
number  of  true positives plus the number of false positives.  Classi‐
fiers with a high precision make very few  mistakes  identifying  posi‐
tives:  they  have  very  few  or no false positives.  However, precise
classifiers may completely fail to identify many,  if  not  most  posi‐
tives; they just don't make mistakes when they do identify them.

Recall,  also  known  as  sensitivity, is defined as the number of true
positives divided by the sum of the number of true positives and  false
negatives.   Classifiers  with high recall will identify most, or maybe
even all positives; however, they may  also  identify  many  negatives,
thus ruining precision.

A  trivial way to maximize precision is to have a very low recall rate,
and conversely, one can very easily have a good recall rate if one does
not  mind  a  poor  precision.   Thus a common goal is to maximize one,
while holding the other to a minimum standard.  One common  problem  is
find  a classifier with the highest possible recall, while holding pre‐
cision to a fixed minimum level; this  may  be  accomplished  with  the
-Hrecall  option.   Alternately,  one may desire to maximize precision,
while maintaining a minimum sensitivity; this may be accomplished  with
the  -Hprerec  option.   Note that, although these two proceedures seem
superficially similar, they can often lead  to  dramatically  different
models  of  the  input  data.   This  is  in part because, during early
stages, moses will choose exemplars that maximize  one  or  the  other,
thus  causing  dramatically different parts of the solution space to be
searched.

A common alternative to maximizing one or  the  other  is  to  maximize
wither  the arithmetic or the harmonic mean of the two.  The arithmetic
mean is sometimes called the "break-even point" or BEP; it is maximized
when  the -Hbep option is specified.  The harmonic mean is known as the
F_1 score, it is maximized when the -Hf_one option is specified.

moses also provides a second way of  maximizing  precision,  using  the
-Hpre  option.  This option searches for the test with the highest pre‐
cision, while holding the 'activation' in a bounded range.  The defini‐
tion  of  'activation'  is idiosyncratic to moses; it is defined as the
sum of true positives plus false positives: that is, it is the fraction
of  rows  for which the trial combo program returned a positive answer,
regardless of whether this was the  right  answer.   Activation  ranges
from  0.0,  to  1.0.   It  is  never  desirable to maximize activation;
rather, most commonly, one wants to peg activation at exactly the frac‐
tion of positives in the training set.

The  minimum  level  to  which  a fixed component should be held may be
specified with the -q or --min-rand-input option.  Thus, for the  -Hre‐
call problem, the -q flag is used to specify the minimum desired preci‐
sion.  Similarly, for the -Hprerec problem, the  -q  flag  is  used  to
specify  the minimum desired recall.  For the -Hpre problem, the  -w or
--max-rand-input option should be used to make sure the activation does
not get too high.

The  -q   and  -w  options  also set lower and upper bounds for the BEP
problem as well.   When maximizing BEP, the system attempts to keep the
absolute value of the difference between precision and recall less than
0.5.  This maximum difference can be over-ridden with the -w option.

Adherence to the bounds is done by means of a  scoring  penalty;  combo
programs  that  fail to lie within bounds are penalized.  The harshness
or hardness of the penalty may be specified  by  means  of  the  -Q  or
--alpha  option.  Values much greater than one enforce a hard boundary;
values much less than one make for a very soft boundary.  Negative val‐
ues are invalid.

Contin options
Options  that affect the usage of continuously-valued variables.  These
options specify values that are used in a variety  of  different  ways,
depending  on  the  chosen  problem type.  See appropriate sections for
more details.

-Q hardness, --alpha=hardness
The harshness of hardness of a limit that must  be  adhered  to.
Default 0.0 (limits disabled).

-q num, --min-rand-input=num
Minimum value for continuous variables. Default 0.0.

-w num, --max-rand-input=num
Maximum value for continuous variables.  Default 1.0.

-R num, --discretize-threshold=num
Split  a continuous domain into two pieces. This option maybe be
used multiple times to split a continuous domain  into  multiple
pieces: that is, n uses of this option will create n+1 domains.

Demo options
These  options pertain to the various built-in demo and example problem
modes.  Such demo problems are  commonly  used  to  evaluate  different
machine  learning  algorithms, and are thus included here to facilitate
such comparison, as well as to simplify moses  regression  and  perfor‐
mance testing.

-H type, --problem-type=type
A  number of demonstration problems are supported. In each case,
the top results are printed to stdout, as a score, followed by a
combo program.  type may be one of:

cp    Combo program regression. The scoring function is based on
the combo program specified with the -y flag. That is, the
goal of the run is to deduce and learn the specified combo
program.

When specifying combo programs with  continuous  variables
in them, be sure to use the -q, -w and -b flags to specify
a range of input values to be sampled. In order to  deter‐
mine  the fitness of any candidate, it must be compared to
the specified combo program.  The comparison is done at  a
variety of different input values. If the range of sampled
input values is inappropriate, or if there are not  enough
sampled values, then the fitness function may select unex‐
pected, undesired candidates.

dj    Disjunction problem. The scoring function awards a  result
that  is  a  boolean  disjunction (or) of N boolean-valued
variables.  The resulting combo program should be or($1$2
...).   The  size of the problem may be specified with the
-k option.

mux   Multiplex problem. The scoring function models  a  boolean
digital  multiplexer, that is, an electronic circuit where
an "address" of n bits selects one and only one line,  out
of 2^n possible lines. Thus, for example, a single address
bit can select one of two possible lines:  the  first,  if
its  false, and the second, if its true. The -k option may
be used to specify the value of n.  The actual size of the
problem, measured in bits, is n+2^n and so increases expo‐
nentially fast.

pa    Even parity problem.  The resulting combo program computes
the  parity of k bits, evaluating to true if the parity is
even, else evaluating to false.  The size of  the  problem
may be specified with the -k option.

sr    Polynomial   regression   problem.  Given  the  polynomial
p(x)=x+x^2+x^3+...x^k, this searches for the shortest pro‐
gram  consisting of nested arithmetic operators to compute
p(x), given x as a free variable. The arithmetic operators
would  be  addition, subtraction, multiplication and divi‐
sion; exponentiation is not allowed in the solution.   So,
for  example,  using the -k2 option to specify the order-2
polynomial x+x^2, then the shortest combo program is *(+(1
$1)$1) (that is, the solution is p(x)=x(x+1) in the usual
arithmetical notation).

-k size, --problem-size=size
Specify the size of the problem.   The  interpretation  of  size
depends on the particular problem type.

-y prog, --combo-program=prog
Specify  the  combo program to be learned, when used in combina‐
tion with the -H  cp  option.   Thus,  for  example,  -H  cp  -y
"and(\$1 \$2)" specifies that the two-input conjunction is to be
learned.  Keep in mind that \$ is a reserved  character  in  many
shells, and thus must be escaped with a backslash in order to be
passed to moses.

Complexity Penalty
The speed with which the search algorithm can find a  reasonable  solu‐
tion  is  significantly affected by the complexity ratio specified with
the -z or -p options. This section provides the  theoretical  underpin‐
ning  for the meaning of these flags, and how they affect the the algo‐
rithm.  The complexity penalty has two slightly  different  interpreta‐
tions,  depending  on whether one is considering learning a discretely-
valued problem (i.e. boolean-valued) or a continuously-valued  problem.
The  general  structure  of  the  argument  is broadly similar for both
cases; they are presented below.  Similar arguments apply for classifi‐
cation  problems  (learning to classify data into one of N categories),
and for precision maximization.

Discrete case
Let M be the model to be learned (the combo program).   Let  D  be  the
data, assumed to be a table of n inputs i_k and one output o, with each
row in the form:

i_1 ... i_n o

Here, i_k is the k'th input and o the output.  In the below, we write o
= D(x) where x=(i_1, ..., i_n) is an input data row.

We  want to assess the probability P(M|D) of the model M conditioned on
the data D.  In particular, we wish to maximize this,  as  it  provides
the fitness function for the model.  According to Bayes theorem,

P(M|D) = P(D|M) * P(M) / P(D)

Consider the log likelihood LL(M) of M knowing D.  Since D is constant,
we can ignore P(D), so:

LL(M) = log(P(D|M)) + log(P(M))

Assume each output of M on row x has probability p of being wrong.  So,

P(D|M) = Prod_{x\in D} [p*(M(x) != D(x)) + (1-p)*(M(x) == D(x))]

where D(x) the observed result given input x.  Then,

log P(D|M) = Sum_{x\in D} log[p*(M(x) !=  D(x))  +  (1-p)*(M(x)  ==
D(x))]

Let D = D_eq \cup D_ne  where D_eq and D_ne are the sets

D_eq = {x \in D | M(x) == D(x) }
D_ne = {x \in D | M(x) != D(x) }

Then

log P(D|M) = Sum_{x\in D_ne} log(p) + Sum_{x\in D_eq} log(1-p)
= |D_ne| log(p) + |D_eq| log(1-p)
= |D_ne| log(p) + |D| log(1-p) - |D_ne| log(1-p)
= |D_ne| log(p/(1-p)) + |D| log(1-p)

Here,  |D| is simply the size of set D, etc.  Assuming that p is small,
i.e. much less than one, then, to second order in p:

log(1-p) = -p + p^2/2 + O(p^3)

So:

log P(D|M) = |D_ne| log(p) - p (|D| - |D_ne|) + O(p^2)

Next, assume P(M) is distributed according  to  Solomonoff's  Universal
Distribution, approximated by (for now)

P(M) = |A|^-|M|
= exp(-|M|*log(|A|))

where A is the alphabet of the model, |A| is the alphabet size, and |M|
is the complexity of the model.  Note that this distribution is identi‐
cal  to  the  Boltzmann  distribution,  for  an  inverse temperature of
log(|A|). Putting it all together, the log-likelihood of M is:

LL(M) = -|M|*log(|A|) + |D_ne| log(p/(1-p)) + |D| log(1-p)

To get an expression usable for a scoring function, just bring out  the
|D_ne| by dividing by -log(p/(1-p)), to get

score(M) = - [ LL(M) - |D| log(1-p) ] / log(p/(1-p))
= -|D_ne| + |M|*log|A| / log(p/(1-p))
= -|D_ne| - |M| |C_coef|

Note  that,  since p<1, that log(p) is negative, and so the second term
is negative.  It can be understood as a complexity penalty.   That  is,
we define the complexity penalty as

complexity_penalty = |M| |C_coef|

The complexity ratio, as set by the -z option, is given by

complexity_ratio = 1 / |C_coef|

By  contrast,  the -p option may be used to set p directly, as given in
the formulas above.  The value of |A| is computed internally, depending
on  the  specific  problem  type  (discrete  vs.  continuous, number of
included-excluded operators, etc.)  The complexity of each solution  is
also computed, using an ad-hoc complexity measure.

Continuous case
A  similar  argument to the above holds for the case of a continuously-
valued observable.

Let dP(..) be the notation for a probability density (or measure).   As

dP(M|D) = dP(D|M) * P(M) / P(D)

Since  D  is constant, one may ignore the prior P(D), and write the log
likelihood of M knowing D as:

LL(M) = log(dP(D|M)) + log(P(M))

Assume the output of of the model M on input x has a Gaussian distribu‐
tions,  of  mean  M(x) and variance V, so that dP(D|M), the probability
density of the data D given the modem M is:

dP(D|M) = Prod_{x\in D} (2*Pi*V)^(-1/2) exp(-(M(x)-D(x))^2/(2*V))

As before, assume a model distribution of

P(M) = |A|^-|M|

where |A| is the alphabet size and |M| the  complexity  of  the  model.
After simplification, and dropping a constant term that does not depend
on either the model complexity or the dataset itself (the dataset  size
is a constant), one then can deduce a scoring function:

score(M) = -|M|*log(|A|)*2*V - Sum_{x\in D} (M(x)-D(x))^2

As  before,  |M|*log(|A|)*2*V  can be interpreted as a scoring penalty.
Alternately, one may interpret each  row  x  as  a  feature;  then  the
penalty  term |M|*log(|A|)*2*V can be interpreted as an additional fea‐
ture that must be fit.

Distributed processing
MOSES provides two different styles of distributed processing for clus‐
ter  computing systems.  One style is to use MPI (as implemented in the
OpenMPI/MPICH2 systems), the second is to use SSH. The first  style  is
best  suited  for  local area networks (LANs) and compute clusters. The
second style allows operation over the open Internet, but is more prob‐
lematic, as it may require manual cleanup of log files and failed jobs.
Because MPI is easy to install and manage, it is the recommended method
for distributing moses operation across many machines.

When  moses  is  run  in distributed fashion, one single node, the root
node, maintains control over a pool of workers that execute  on  remote
nodes.   The  root  node  maintains the set of candidate solutions, and
assigns these to the workers for additional exploration, as the workers
become  free.  The results are automatically collected by the root, and
are automatically merged into the candidate population.  When  termina‐
tion  criteria are met, processing will terminate on all nodes, and the
root node will report the merged, best results.

MPI
Using MPI requires a moses binary with MPI support  compiled  in,  with
either the OpenMPI or the MPICH2 implementations.  MPI support in MOSES
is enabled with the --mpi=1 command-line flag.  When this flag is spec‐
ified,  moses  may be run as usual in an MPI environment.  Details will
vary from one system configuration to  another,  but  a  typical  usage
might resemble the following:

mpirun -n 15 --host‐
file mpd.hosts moses --mpi=1 -j 12 <other moses params>

The above specifies that the run should  be  distributed  over  fifteen
nodes,  with  the  node  names  specified  in  the mpd.hosts file.  The
--mpi=1 flag indicates to moses that it is being run in an MPI environ‐
ment.   The  -j12  flag tells moses to use up to twelve threads on each
node, whenever possible; this example assumes each node  has  12  cores
per CPU.

To maximize CPU utilization, it seems best to specify two MPI instances
per node.  This is because not all parts of moses are parallelized, and
some  parts  are  subject  to  lock contention.  Thus, running multiple
instances per node seems to be an effective way to utilize  all  avail‐
able  compute  power  on  that node.  It is almost always the case that
moses RAM usage is relatively small, and so RAM availability is  rarely
a  problem.   The network utilization by moses is also very modest: the
only network traffic is the reporting of candidate  solutions,  and  so
the network demands are typically in the range of 1 Megabit per second,
and are thus easily supported on an  Ethernet  connection.   The  moses
workload  distributes  in  an 'embarrassingly parallel' fashion, and so
there is no practical limit to scaling  on  small  and  medium  compute
clusters.

When performing input data regression, be sure that the input data file
is available on all nodes.   This is most easily  achieved  by  placing
the  input  data  file  on a shared filesystem.  Each instance of moses
will write a log file.  In order to avoid name  collision  on  the  log
files, the process id (PID) will automatically be incorporated into the
log-file name when the --mpi=1 option is specified.  Log file  creation
can  be  disabled  with  the -lNONE option; however, this is not recom‐
mended, as it makes debugging and progress monitoring difficult.

SSH
The SSH style of distributed processing uses the ssh command to  estab‐
lish communications and to control the pool of workers.  Although moses
provides job control when the system is running normally, it  does  not
provide  any mechanism for cleaning up after hung or failed jobs;  this
is outside the scope of the ssh implementation.  The use of a job  man‐
ager, such as LSF, is recommended.

Remote  machines  are specified using the -j option, using the notation
-j N:REMOTE_HOST.  Here, N is the number  of  threads  to  use  on  the
machine  REMOTE_HOST.   For  instance,  one  can  enter the options -j4
-j16:my_server.org (or -j16:user@my_server.org if one wishes to run the
remote  job  under  a  different user name), meaning that 4 threads are
allocated on  the  local  machine  and  16  threads  are  allocated  on
my_server.org.   Password prompts will appear unless ssh-agent is being
used.  The moses executable must  be  on  the  remote  machine(s),  and
located  in  a  directory  included  in  the PATH environment variable.
Beware that a lot of log files are going to be generated on the  remote
machines.

TODO
Finish documenting these algo flags: --fs-scorer

-M  --diversity-pressure  --diversity-exponent  --diversity-normalize
--diversity-dst --diversity-p-norm --diversity-dst2dp

-R discretize target var

These input flags: -G

Interesting patterns flags: -J -K -U -X