# MOSES man page

From OpenCog

Man page for MOSES (raw dump).
`
`

MOSES(1) OpenCog Learning MOSES(1) NAME moses - meta-optimizing semantic evolutionary search solver SYNOPSIS moses -h|--help moses --version moses [-H problem_type] [-i input_filename] [-u target_feature] [--enable-fs=1] [-Y ignore_feature] [-m max_evals] [--max-time seconds] [-f logfile] [-l loglevel] [-r random_seed] [--mpi=1] [-n ignore_opera‐ tor] [-N use_only_operator] [-o output_file] [--boost] [-A max_score] [-a algo] [-B knob_effort] [-b nsamples] [-C1] [-c result_count] [-D max_dist] [-d1] [-E reduct_effort] [-e exemplar] [-F] [-g max_gens] [-I0] [-j jobs] [-k problem_size] [-L1] [-P pop_size_ratio] [-p proba‐ bility] [--python=1] [-Q hardness] [-q min] [-R threshold] [-S0] [-s1] [--score-weight column] [-T1] [-t1] [-V1] [-v temperature] [-W1] [-w max] [--well-enough=1] [-x1] [-y combo_program] [-Z1] [-z complex‐ ity_ratio] DESCRIPTION moses is the command-line wrapper to the MOSES program learning library. It may be used to solve learning tasks specified with file inputs, or to run various demonstration problems. Given an input table of values, it will perform a regression, generating, as output, a combo program that best fits the data. moses is multi-threaded, and can be distributed across multiple machines to improve run-time performance. EXAMPLE As an example, the input file of comma-separated values: a,b,out 0,0,0 0,1,0 1,0,0 1,1,1 when run with the flags -Hit -u out -W1 will generate the combo program and($a $b). This program indicates that the third column can be mod‐ elled as the boolean-and of the first two. The -u option specifies that it is the third column that should be modelled, and the -W1 flag indicates that column labels, rather than column numbers, should be printed on output. OVERVIEW moses implements program learning by using a meta-optimization algo‐ rithm. That is, it uses two optimization algorithms, one wrapped inside the other, to find solutions. The outer optimization algorithm selects candidate programs (called exemplars), and then creates similar pro‐ grams by taking an exemplar and inserting variables (called knobs) in selected locations. The inner optimization algorithm then searches for the best possible 'knob settings', returning a set of programs (a deme) that provide the best fit to the data. The outer optimization algorithm then selects new candidates, and performs a simplification step, reduc‐ ing these to a normal form, to create new exemplars. The process is then repeated until a perfect score is reached, a maximum number of cycles is performed, or forward progress is no longer being made. Program reduction is performed at two distinct stages: after inserting new knobs, and when candidates are selected. Reduction takes a pro‐ gram, and transforms it into a new form, the so-called elegant normal form, which, roughly speaking, expresses the program in the smallest, most compact form possible. Reduction is an important step of the algorithm, as it prevents the evolved programs from turning into 'spaghetti code', avoiding needless complexity that can cost during program evaluation. The operation of moses can be modified by using a large variety of options to control different parts of the above meta-algorithm. The inner optimization step can be done using one of several different algorithms, including hillclimbing, simulated annealing, or univariate Bayesian optimization. The amount of effort spent in the inner loop can be limited, as compared to how frequently new demes and exemplars are chosen. The total number of demes or exemplars maintained in the population may be controlled. The effort expended on program normal‐ ization at each of the two steps can be controlled. In addition to controlling the operation of the algorithm, many options are used to describe the format of the input data, and to control the printing of the output. moses also has a 'demo' mode: it has a number of built-in scoring func‐ tions for various 'well-known' optimization problems, such as parity, disjunction, and multiplex. Each of these typically presents a differ‐ ent set of challenges to the optimization algorithm, and are used to illustrate 'hard' optimization problems. These demo problems are use‐ ful both for learning and understanding the use of moses, and also for developers and maintainers of moses itself, when testing new features and enhancements. COMBO PROGRAMS This section provides a brief overview of the combo programming lan‐ guage. Combo is not meant so much for general use by humans for prac‐ tical programming, as it is to fit the needs of program learning. As such, it is fairly minimal, while still being expressive enough so that all common programming constructs are simply represented in combo. It is not difficult to convert combo programs into other languages, or to evaluate them directly. In combo, all programs are represented as program trees. That is, each internal node in the tree is an operation and leaves are either con‐ stants or variables. The set of all of the variables in a program are taken as the arguments to the program. The program is evaluated by sup‐ plying explicit values for the variables, and then applying operations until the root node is reached. By convention, variables are denoted by a dollar-sign followed by a number, although they may also be named. Variables are not explicitly typed; they are only implicitly typed during program evaluation. Thus, for example, or ($1 $2) is a combo program that represents the boolean- or of two inputs, $1 and $2. Built-in constants include true and false; numbers may be written with or without a decimal point. Logical operators include and, or and not. Arithmetic operators include +, - * and / for addition, subtraction, multiplication and division. Additional arithmetic operators include sin, log, exp and rand. The rand operator returns a value in the range from 0.0 to 1.0. The predicate operator 0< (greater than zero) takes a single continuous-valued expression, and returns a boolean. The opera‐ tor impulse does the opposite: it takes a single boolean-valued expres‐ sion, and returns 0.0 or 1.0, depending on whether the argument evalu‐ ates to false or true. The cond operator takes an alternating sequence of predicates and val‐ ues, terminated by an 'else' value. It evaluates as a chain of if- then-else statements, so that it returns the value following the first predicate to evaluate to true. Thus, as an example, the expression and( 0<( +($x 3)) 0<($y)) evalu‐ ates to true if $x is greater than -3 and $y is greater than zero. The above operators are built-in; combo may also be extended with cus‐ tom-defined operators, although C++ programming and recompilation is required for this. OPTIONS Options fall into four classes: those for specifying inputs, control‐ ling operation, printing output, and running moses in demo mode. General options -e exemplar Specify an initial exemplar to use. Each exemplar is written in combo. May be specified multiple times. -h, --help Print command options, and exit. -j num, --jobs=num Allocate num threads for deme optimization. Using multiple threads will speed the search on a multi-core machine. This flag may also be used to specify remote execution; see the sec‐ tion Distributed processing below. --version Print program version, and exit. Problem-type options MOSES is able to handle a variety of different 'problem types', such as regression, categorization and clustering, as well as a number of demo problems, such as parity and factorization. The -H option is used to specify the problem type; the demo problem types are listed in a later section. This section lists the various "table" problems, where the goal is to train moses on an input table of values. -H type, --problem-type=type The type of problem may be one of: it Regression on an input table. That is, the input table consists of a set of columns, all but one considered 'inputs', and one is considered an output. The goal of regression is to learn a combo program that most accu‐ rately predicts the output. For boolean-valued and enumerated outputs, the scoring function simply counts the number of incorrect answers, and tries to minimize this score. That is, this scorer attempts to maximize accuracy (defined below). For contin-valued outputs, the mean-square variation is minimized. pre Regression on an input table, maximizing precision, instead of accuracy (that is, minimizing the number of false positives, at the risk of sometimes failing to identify true positives). Maximization is done while holding activation constant. This scorer is ideal for learning "domain experts", which provide a result only when they are certain that the result is correct. So, for a binary classifier, the output is meant to be interpreted as "yes, certain" or "no, don't know". An ensemble of such combo expressions is commmonly called a "mixture of experts". prerec Regression on an input table, maximizing precision, while attempting to maintain the recall (sensitivity) at or above a given level. recall Regression on an input table, maximizing recall (sensi‐ tivity) while attempting to maintain the precision at or above a given level. This scorer is most commonly used when it is important to guarantee a certain level of precision, even if it means rejecting most events. In medicine and physics/radio applications, recall is exactly the same thing as sensitivity: this option searches for the most sensitive test while holding to a minimum level of precision. bep Regression on an input table, maximizing the arithmetic mean of the precision and recall, also known as the "break-even point" or BEP. f_one Regression on an input table, maximizing the harmonic mean of the precision and recall, that is, the F_1 score. ip Discovery of "interesting predicates" that select rows from the input table. The data table is assumed to con‐ sist of a number of boolean-valued input columns, and a contin-valued (floating point) target column. moses will learn predicates that select the most "interest‐ ing" subset of the rows in the table. The values in the output columns of the selected rows form a proba‐ bility distribution (PDF); this PDF is considered to be "interesting" if it maximizes a linear combination of several different measures of the the PDF: the Kull‐ back-Leibler divergence, the skewness, and the stan‐ dardized Mann-Whitney U statistic. kl Regression on an input table, by maximizing the Kull‐ back-Leibler divergence between the distribution of the outputs. That is, the output must still be well- scored, but it is assumed that there are many possible maxima. (XXX???) Huh? ann-it Regression on an input table, using a neural network. (kind-of-like a hidden Markov model-ish, kind of. XXX Huh???) For boolean-valued data tables, the scorers make use of the following generally-accepted, standard defintions: TP True-positive: the sum of all rows for which the combo model predicts T for a data-table row marked T. FP False-positive: the sum of all rows for which the combo model predicts T for a data-table row marked F. FN False-negative: the sum of all rows for which the combo model predicts F for a data-table row marked T. TN True-negative: the sum of all rows for which the combo model predicts F for a data-table row marked F. accuracy Defined as the formula (TP + TN) / (TP + TN + FP + FN) activation Defined as the formula (TP + FP) / (TP + TN + FP + FN) precision Defined as the formula TP / (TP + FP) recall Also known as sensitivity, this is defined as the formula TP / (TP + FN) F_1 Harmonic mean of precision and recall, defined as the formula (2 * precision * recall) / (precision + recall) = 2TP / (2TP + FP + FN) bep Break-even point, the arithmetic mean of precision and recall, defined as the formula (precision + recall) / 2 Input specification options These options control how input data is specified and interpreted. In its primary mode of operation, moses performs regression on a a table of input data. One column is designated as the target, the remaining columns are taken as predictors. The output of regression is a combo program that is a function of the predictors, reproducing the target. Input files should consist of ASCII data, separated by commas or white‐ space. The appearance of # ; or ! in the first column denotes a com‐ ment line; this line will be ignored. The first non-comment row, if it is also non-numeric, is taken to hold column labels. The target column may be specified using the -u option with a column name. The printing of column names on output is controlled with the -W1 flag. -i filename, --input-file=filename The filename specifies the input data file. The input table must be in 'delimiter-separated value' (DSV) format. Valid separa‐ tors are comma (CSV, or comma-separated values), blanks and tabs (whitespace). Columns correspond to features; there is one sam‐ ple per (non-blank) row. Comment characters are hash, bang and semicolon (#!;) lines starting with a comment are ignored. The -i flag may be specified multiple times, to indicate multiple input files. All files must have the same number of columns. -u column, --target-feature=column The column is used as the target feature to fit. If no column is specified, then the first column is used. The column may be numeric, or it may be a column label. If it is numeric, it is taken to be the number of the column, with column 1 being the left-most. If column begins with an alphabetic character, it is taken to be a column label. In this case, the first non-comment row of the input file must contain column labels. -Y column, --ignore-feature=column The column should be ignored, and not used as input. Columns are specified as above. This option may be used multiple times, to ignore multiple columns. --score-weight=column The column is used to weight the score for each row. If this option is not used, then each row in the table contributes equally to the evaluation of the accuracy of the learned model. However, if some rows are more important than others to get right, this option can be used to indicate those rows. The accu‐ racy of the model will be weighted by this number, when evaluat‐ ing the score. A weight of zero effectively causes the row to be ignored. A negative weight enourages the system to learn mod‐ els that get the row incorrect. For boolean problems, this is the same as flipping the output value. This option can only be used once, and, if used, it should specify a column containing an integer or floating-point value. -b num, --nsamples=num The number of samples to be taken from the input file. Valid values run between 1 and the number of rows in the data file; other values are ignored. If this option is absent, then all data rows are used. If this option is present, then the input table is sampled randomly to reach this size. -G1, --weighted-accuracy=1 This option is only used for the discretize_contin_bscore (when --discretize-threshold is used), if enabled, then the score cor‐ responds to weighted accuracy Useful in case of unbalanced data. --balance=1 If the table has discrete output type (like bool or enum), bal‐ ance the resulting ctable so all classes have the same weight. Algorithm control options These options provide overall control over the algorithm execution. The most important of these, for controlling behavior, are the -A, -a, -m, --max-time, -r, -v and -z flags. -a algorithm, --algo=algorithm Select the algorithm to apply to a single deme. This is the algorithm used in the 'inner loop': given a single exemplar dec‐ orated with tunable knobs, this algorithm searches for the best possible knob settings. Once these are found (or a timeout, or other termination condition is reached), control is returned to the outer optimization loop. Available algorithms include: hc Hill-climbing. There are two primary modes of operation; each has strengths and weaknesses for different problem types. In the default mode, one begins with an initial collection of knob settings, called an instance. The set‐ tings of each knob is then varied, in turn, until one set‐ ting is found that most improves the score. This setting then becomes the new instance, and the process is repeated, until no further improvement is seen. The resulting instance is a local maximum; it is returned to the outer loop. The alternate mode of operation is triggered by using the -L1 flag (usually with the -T1 flag). In this case, as before, all knob settings are explored, one knob at a time. After finding the one knob that most improves the score, the algo is done, and the resulting instance is returned to the outer loop. If no knob settings improved the score, then all possible settings of two knobs are explored, and then three, etc. until improvement is found (or the allot‐ ted iterations are exceeded). In this alternate mode, the local hill is not climbed to the top; instead, any improve‐ ment is immediately handed back to the outer loop, for another round of exemplar selection and knob-building. For certain types of problems, including maximally misleading problems, this can arrive at better solutions, more quickly, than the traditional hill-climbing algorithm described above. sa Simulated annealing. (Deprecated). The -D flag controls the size of the neighborhood that is searched during the early, "high-temperature" phase. It has a significant effect on the run-time performance of the algorithm. Using -D2 or -D3 is likely to provide the best performance. The current implementation of this algorithm has numerous faults, making it unlikely to work well for most problems. un Univariate Bayesian dependency. -A score, --max-score=score Specifies the ideal score for a desired solution; used to termi‐ nate search. If the maximum number of evaluations has not yet elapsed (set with the -m option), and a candidate solution is found that has at least this score, then search is terminated. -m num, --max-evals=num Perform no more than num evaluations of the scoring function. Default value is 10000. --max-time= secs Run the optimizer for no longer than secs seconds. Note that timing is polled only in a small number of points in the algo‐ rithm; thus, actual execution time might exceed this limit by a few seconds, or even many minutes, depending on the problem type. In particular, knob-building time is not accounted for right away, and thus problems with a long knob-building time will exceed this limit. If using this option, be sure to set the -m option to some very large value. Default value is 42 years. -n oper, --ignore-operator=oper Exclude the operator oper from the program solution. This option may be used several times. Currently, oper may be one of div, sin, exp, log, impulse or a variable #n. You may need to put variables under double quotes. This option has the priority over the -N option. That is, if an operator is both be included and ignored, then it is ignored. This option does not work with ANN. --linear-regression=1 When attempting to fit continuous-valued features, restrict searches to linear expressions only; that is, do not use polyno‐ mials in the fit. Specifying this option also automatically disables the use of div, sin, exp and log. Note that polynomial regression results in search spaces that grow combinatorially large in the number of input features; That is, for N features, a quadratic search will entail O(N^2) possibilities, a cubic search will explore O(N^3) possibilities, and so on. Thus, for any problem with more than dozens or a hundred features, linear regression is recommended. -r seed, --random-seed=seed Use seed as the seed value for the pseudo-random number genera‐ tor. -s1, --enable-cache=1 Enable memoization of candidate scores. This allows the number of scoring function evaluations to be reduced, by maintaining a cache of recently scored candidates. If a new candidate is found in the cache, that score is used, instead of a scoring function evaluation. The effectiveness of memoization is greatly increased by also using the -d1 flag. -v temperature, --complexity-temperature=temperature Set the "temperature" of the Boltzmann-like distribution used to select the next exemplar out of the metapopulation. A tempera‐ ture that is too high or too low will make it likely that poor exemplars will be chosen for exploration, thus resulting in excessively long search times. The recommended temperature depends strongly on the type of problem being solved. If it is known that the problem has false maxima, and that the distance from the top of the false maximum to the saddle separating the false and true maximum is H, then the recommended temerature is 30*H. Here, H is the 'height' or difference in score from false peak to saddle, the saddle being the highest mountain pass between the false and true minumum. Varying the temperature by a factor of 2 or 3 from this value won't affect results much. Too small a temperature will typically lead to the system getting trapped at a local maximum. The demo parity problem works well with a temperature of 5 whereas the demo Ant trail problem requies a temperature of 2000. -z ratio, --complexity-ratio=ratio Fix the ratio of score to complexity, to be used as a penalty, when ranking the metapopulation for fitness. This ratio is meant to be used to limit the size of the search space, and, when used with an appropriate temperature, to avoid gettting trapped in local maxima. Roughly speaking, the size of the search space increases expo‐ nentially with the complexity of the combo trees being explored: more complex trees means more of them need to be examined. How‐ ever, more complex trees typically result in higher scores. If an increase of N bits in the complexity typically leads to an increase of s points of the score, then the complexity ratio should be set to about N/s. In this way, the exploration of more complex tree is penalized by an amount roughly comparable to the chance that such complicated trees actually provide a better solution. The complexity ratio is used to calculate a scoring penalty; the penalty lowers the score in proportion to the solution complex‐ ity; specifically, the penalty is set to the complexity divided by the complexity ratio. Setting the ratio too low causes the algorithm to ignore the more complex solutions, ranking them in a way so that they are not much explored. Thus, the algorithm may get trapped examining only the simplest solutions, which are probably inappropriate. Setting this ratio too high will prevent a good solution from being found. In such cases, the algorithm will spend too much time evaluating overly-complex solutions, blithly leaving sim‐ pler, better solutions unexplored. The relationship between the score change and the complexity change is very strongly data-dependent, and must (currently) be manually determined (although it might be possible to measure it automatically). Input data tables with lots of almost-duplicate data may have very low ratios; complex problems with sparse data may have very high ratios. Initial recommended values would be in the range from 1 to 5; with 3.5 as the default. The parity demo problem works well with the 3.5 default, the Ant trail demo problem works well with 0.16. -Z1, --hc-crossover=1 Controls hill-climbing algorithm behavior. If false (the default), then the entire local neighborhood of the current cen‐ ter instance is explored. The highest-scoring instance is then chosen as the new center instance, and the process is repeated. For many datasets, however, the highest-scoring instances tend to cluster together, and so an exhaustive search may not be required. When this option is specified, a handful of the high‐ est-scoring instances are crossed-over (in the genetic sense of cross-over) to create new instances. Only these are evaluated for fitness; the exhaustive search step is skipped. For many problem types, especially those with large neighborhoods (i.e. those with high program complexity), this can lead to an order- of-magnitude speedup, or more. For other problem types, espe‐ cially those with deceptive scoring functions, this can hurt performance. --boost=1 Enables the use of boosting. Currently, only the AdaBoost algo‐ rithm is implemented. If boosting is enabled, then the system focuses on learning combo programs that correctly classify those values that were mis-classified in the previous round of learn‐ ing. The output is a set of weighted combo trees. At this time, boosting can only be used to learn binary classi‐ fiers: that is, to learn problems in which the scorer can pro‐ vide a yes/no answer. Boosted dynamic feature selection Problems with a large number of input features (typically, hundreds or more) can lead to excessively long run-times, and overwhelming amounts of memory usage. Such problems can be tamed in two different ways: by static feature pre-selection (i.e. by giving moses fewer features to work with) or by reducing the number of features considered during the knob-building stage. Recall that, as moses runs, it iterates over two loops: an outer loop where an exemplar is selected and decorated with knobs, and an inner loop, where the knobs are 'turned', searching for the best knob-set‐ tings. Each knob corresponds to one feature added to a particular location in the exemplar tree. A single, given feature will then be used to build a number of different knobs, each in a different place in the tree. As trees get larger, there are more places in each tree that can be decorated with knobs; thus, the number of knobs, and the search space, increases over time. If there are M places in the tree that can be decorated, and there are N features, then M*N knobs will be created. The search space is exponential in the number of knobs; so, e.g. for boolean problems, three knob settings are explored: present, absent and negated, leading to a search space of 3^(M*N). Worse, the resulting large trees also take longer to simplify and evaluate. Clearly, limit‐ ing the number of knobs created can be a good strategy. This can be done with dynamic, boosted feature selection. When enabled, a feature-selection step is performed, to find those features that are most likely to improve on the examplar. These are the fea‐ tures that will be strongly (anti-)correlated with incorrect answers from the currently-selected exemplar. By using only these features during knob building, the total number of knobs can be sharply decreased. The resulting size ofthe decorated tree is smaller, and the total search space is much smaller. Dynamic, boosted feature selection differs from static feature pre- selection in many important ways. In dynamic selection, the total num‐ ber of features available to moses is not reduced: just because a fea‐ ture was not used in a given round of knob-decoration does not mean that it won't be used next time. Dynamic feature selection also resem‐ bles boosting, in that it focuses on fixing the wrong answers of the current exemplar. By contrast, static pre-selection permanently dis‐ cards features, making them unavailable to moses; it is also unable to predict which features are the most likely to be useful during itera‐ tion. Boosted, dynamic, feature selection is enabled with the --enable-fs=1 option. The number of features to use during knob building is speci‐ fied using the --fs-target-size option. A number of additional flags control the behaviour of the feature selection algorithm; these are best left alone; the defaults should be adequate for almost all prob‐ lems. The man page for the feature-selection command describes these in greater detail. --enable-fs=1 Enable integrated feature selection. Feature selection is dis‐ abled by default. --fs-target-size=num Select num features for use. This argument is mandatory if fea‐ ture selection is enabled. --fs-algo Choose the feature-selection algorithm. Possible choices are simple, inc, smd, hc and random. The default value is simple, which is the fastest and best algo. The simple algorithm com‐ putes the pair-wise mutual information (MI) between the target and each feature, and then selects the list of num features with the highest MI. It is strongly recommended that this algo be used. The smd algorithm implements "stochastic mutual depen‐ dency", by computing the MI between a candidate featureset and the target. Starting with an initially empty featureset, fea‐ tures are randomly added one-by-one, with the MI then computed. Only the featureset with the highest MI is kept; the process is then repeated until the featureset has the desired number of features in it, or the MI has stopped increasing. Note that the smd algorithm effectively prevents redundant features from being added to the featureset. Note that smd runs orders of magnitude more slowly than simple, and probably does not provide better results. The inc is similar to the smd algo, except that it adds many features, all at once, to the featureset, and then attempts to find the redundant features in the featureset, removing those. This is iteratively repeated until the desired number of features is obtained. Note that inc runs orders of magnitude more slowly than simple, and probably does not provide better results. The hc algorithm runs moses hill-climbing to discover those features most likely to appear. The random algo‐ rithm selects features randomly. It is useful only for limiting the numberof knobs created, but not otherwise slantnig the choice of features. --fs-threshold=num Set the minimum threshold for selecting a feature. --fs-inc-redundant-intensity=fraction When using the inc algorithm, set the threshold to reject redun‐ dant features. --fs-inc-target-size-epsilon=tolerance When using the inc algorithm, set the smallest step size used. --fs-inc-interaction-terms=num_terms When using the inc algorithm, set the number of terms used when computing the joint entropy. --fs-hc-max-score TODO write description --fs-hc-confidence-penalty-intensity TODO write description --fs-hc-max-evals TODO write description --fs-hc-fraction-of-remaining TODO write description Large problem parameters Problems with a large number of features (100 and above) often evolve exemplars with a complexity of 100 or more, which in turn may have instances with hundreds of thousands of nearest neighbors. Exploring one nearest neighbor requires one evaluation of the scoring function, and so an exhaustive search can be prohibitive. A partial search can often work quite well, especially when cross-over is enabled. The fol‐ lowing flags control such partial searches. --hc-max-nn-evals=num Controls hill-climbing algorithm behavior. When exploring the nearest neighborhood of an instance, num specifies the maximum number of nearest neighbors to explore. An exhaustive search of the nearest neighborhood is performed when the number of nearest neighbors is less than this value. --hc-fraction-of-nn=frac Controls hill-climbing algorithm behavior. When exploring the nearest neighborhood of an instance, frac specifies the frac‐ tion of nearest neighborhood to explore. As currently imple‐ mented, only an estimate of the nearest-neighborhood size is used, not the true size. However, this estimate is accurate to within a factor of 2. Thus, to obtain an exhaustive search of the entire neighborhood, set this to 2.0 or larger. Algorithm tuning options These options allow the operation of the algorithm to be fine-tuned for specific applications. These are "advanced" options; changing these from the default is likely to worsen algorithm behavior in all but cer‐ tain special cases. -B effort, --reduct-knob-building-effort=effort Effort allocated for reduction during the knob-building stage. Valid values are in the range 0-3, with 0 standing for minimum effort, and 3 for maximum effort. Larger efforts result in demes with fewer knobs, thus lowering the overall dimension of the problem. This can improve performance by effectively reducing the size of the problem. The default effort is 2. -Ddist, --max-dist=dist The maximum radius of the neighborhood around the exemplar to explore. The default value is 4. -d1, --reduce-all=1 Reduce candidates before scoring evaluation. Otherwise, only dominating candidates are reduced, just before being added to the metapopulation. This flag may be useful if scoring function evaluation expense depends strongly one the structure of the candidate. It is particularly important to specify this flag when memoization is enabled (with -s1). -E effort, --reduct-candidate-effort=effort Effort allocated for reduction of candidates. Valid values are in the range 0-3, with 0 standing for minimum effort, and 3 for maximum effort. For certain very symmetric problems, such as the disjunct problem, greater reduction can lead to significantly faster solution-finding. The default effort is 2. -g num, --max-gens=num Create and optimize no more than num demes. Negative numbers are interpreted as "unlimited". By default, the number of demes is unlimited. --discard-dominated=1 If specified, the "dominated" members of the metapopulation will be discarded. A member of the metapopulation is "dominated" when some existing member of the metapopulation scores better on *every* sample in the scoring dataset. Naively, one might think that an individual that does worse, in every possible way, is useless, and can be safely thrown away. It turns out that this is a bad assumption; dominated individuals, when selected for deme expansion, often have far fitter off-spring than the off- spring of the top-scoring (dominating) members of the metapopu‐ lation. Thus, the "weak", dominated members of the metapopula‐ tion are important for ensuring the vitality of the metapopula‐ tion as a whole, and are discarded only at considerable risk to the future adaptability of the overall population. Put another way: specifying this flag makes it more likely that the metapop‐ ulation will get trapped in a non-optimal local maximum. Note that the algorithms to compute domination are quite slow, and so keeping doinated individuals has a double benefit: not only is the metapopulation healthier, but metapopulation manage‐ ment runs faster. -L1, --hc-single-step=1 Single-step, instead of hill-climbing to the top of a hill. That is, a single uphill step is taken, and the resulting best demes are folded back into the metapopulation. Solving then continues as usual. By default, the hill-climbing algorithm does not sin‐ gle-step; it instead continues to the top of the local hill, before folding the resulting demes back into the metapopulation. If using this flag, consider using the -T1 flag to allow the search to be widened, so that if the initial exemplar is already at the top of a local hill, a search is made for a different (taller) hill. -N oper, --include-only-operator=oper Include the operator oper, but exclude others, in the solution. This option may be used several times to specify multiple opera‐ tors. Currently, oper may be one of plus, times, div, sin, exp, log, impulse or a variable #n. Note that variables and opera‐ tors are treated separately, so that including only some opera‐ tors will still include all variables, and including only some variables still include all operators). You may need to put variables under double quotes. This option does not work with ANN. -P num, --pop-size-ratio=num Controls amount of time spent on a deme. Default value is 20. -p fraction, --noise=fraction This option provides an alternative means of setting the com‐ plexity ratio. If specified, it over-rides the -z option. For discrete problems, fraction can be interpreted as being the fraction of score values that are incorrect (e.g. due to noisy data). As such, only values in the range 0 < fraction < 0.5 are meaningful (i.e. less than half of the data values are incor‐ rect). Typical recommended values are in the range of 0.001 to 0.05. For continuous-valued problems, it can be interpreted as the standard deviation of a Gaussian noise in the dependent variable. For the discrete problem, the complexity ratio is related to the fraction p by the explicit formula: complexity_ratio = - log(p/(1-p)) / log |A| where |A| is the (problem-dependent) alphabet size. See below for a detailed explanation. -T1, --hc-widen-search=1 Controls hill-climbing algorithm behavior. If false (the default), then deme search terminates when a local hilltop is found. If true, then the search radius is progressively widened, until another termination condition is met. Consider using the -D flag to set the maximum search radius. --well-enough=1 For problems with an enumerated ('nominal') output, the learned combo program is always of the form cond(pred_1 value_1 pred_2 value_2 ... pred_n value_n else_val) where pred_1 is a predi‐ cate, which, if true, causes the output to be value_1. If false, then pred_2 is tested, and so on. If none of the predi‐ cates evaluate to true, then the value of the cond expression is the else_val. The well-enough algorithm attempts to find predi‐ cates that maximize precision, the point being that if a per‐ fectly precise pred_1 can be found, then it can be left alone ('leave well-enough alone'), thus simplifying the remainder of the search problem. Performing this evaluation is costly, and may lead to a slow-down, without improving overall accuracy. Output control options These options control the displayed output. Note that, be default, the ten best solutions are printed. These are printed in penalized-score- sorted order, but the actual printed score is the raw, un-penalized score. This can lead to the printed list seeming to be in random order, which can occur if the penalties are high. The penalties can be too high, if the complexity ratio is set too low, or the temperature is set too low. -C1, --output-dominated=1 Print all of the final metapopulation, and not just the highest- scoring candidates. -c count, --result-count=count The number of non-dominated (best) results to return, ordered according to score. If negative, then all results are returned, including the dominated results. -f filename, --log-file=filename Write debug log traces filename. If not specified, traces are written to moses.log. -F, --log-file-dep-opt Write debug log traces to a filename constructed from the passed option flags and values. The filename will be truncated to a maximum of 255 characters. -l loglevel, --log-level=loglevel Specify the level of detail for debug logging. Possible values for loglevel are NONE, ERROR, WARN, INFO, DEBUG, and FINE. Case does not matter. Caution: excessive logging detail can lead to significant program slowdown. The NONE option disables log file creation. This may make error debugging difficult. -o filename, --output-file=filename Write results to filename. If not specified, results are written to stdout. --python=1 Output the highest-scoring programs as python snippets, instead of combo. -S0, --output-score=0 Prevent printing of the score. -t1, --output-bscore=1 Print the behavioral score. -V1, --output-eval-number=1 Print the number of evaluations performed. -W1, --output-with-labels=1 Use named labels instead of position place-holders when printing candidates. For example, *("$temperature" "$entropy") instead of *($3 $4). This option is effective only when the data file con‐ tains labels in its header. -x1, --output-complexity=1 Print the complexity measure of the model, and the scoring penalty. Precision, recall, BEP, F_1 and prerec problem types The prerec, recall, bep, f_one and precision problem types are used to solve binary classification problems: problems where the goal is to sort inputs into one of two sets, while maximizing either the preci‐ sion, the sensitivity, or some other figure of merit of the test. In moses, precision and recall (sensitivity) are defined as usual. Precision is defined as the number of true positives, divided by the number of true positives plus the number of false positives. Classi‐ fiers with a high precision make very few mistakes identifying posi‐ tives: they have very few or no false positives. However, precise classifiers may completely fail to identify many, if not most posi‐ tives; they just don't make mistakes when they do identify them. Recall, also known as sensitivity, is defined as the number of true positives divided by the sum of the number of true positives and false negatives. Classifiers with high recall will identify most, or maybe even all positives; however, they may also identify many negatives, thus ruining precision. A trivial way to maximize precision is to have a very low recall rate, and conversely, one can very easily have a good recall rate if one does not mind a poor precision. Thus a common goal is to maximize one, while holding the other to a minimum standard. One common problem is find a classifier with the highest possible recall, while holding pre‐ cision to a fixed minimum level; this may be accomplished with the -Hrecall option. Alternately, one may desire to maximize precision, while maintaining a minimum sensitivity; this may be accomplished with the -Hprerec option. Note that, although these two proceedures seem superficially similar, they can often lead to dramatically different models of the input data. This is in part because, during early stages, moses will choose exemplars that maximize one or the other, thus causing dramatically different parts of the solution space to be searched. A common alternative to maximizing one or the other is to maximize wither the arithmetic or the harmonic mean of the two. The arithmetic mean is sometimes called the "break-even point" or BEP; it is maximized when the -Hbep option is specified. The harmonic mean is known as the F_1 score, it is maximized when the -Hf_one option is specified. moses also provides a second way of maximizing precision, using the -Hpre option. This option searches for the test with the highest pre‐ cision, while holding the 'activation' in a bounded range. The defini‐ tion of 'activation' is idiosyncratic to moses; it is defined as the sum of true positives plus false positives: that is, it is the fraction of rows for which the trial combo program returned a positive answer, regardless of whether this was the right answer. Activation ranges from 0.0, to 1.0. It is never desirable to maximize activation; rather, most commonly, one wants to peg activation at exactly the frac‐ tion of positives in the training set. The minimum level to which a fixed component should be held may be specified with the -q or --min-rand-input option. Thus, for the -Hre‐ call problem, the -q flag is used to specify the minimum desired preci‐ sion. Similarly, for the -Hprerec problem, the -q flag is used to specify the minimum desired recall. For the -Hpre problem, the -w or --max-rand-input option should be used to make sure the activation does not get too high. The -q and -w options also set lower and upper bounds for the BEP problem as well. When maximizing BEP, the system attempts to keep the absolute value of the difference between precision and recall less than 0.5. This maximum difference can be over-ridden with the -w option. Adherence to the bounds is done by means of a scoring penalty; combo programs that fail to lie within bounds are penalized. The harshness or hardness of the penalty may be specified by means of the -Q or --alpha option. Values much greater than one enforce a hard boundary; values much less than one make for a very soft boundary. Negative val‐ ues are invalid. Contin options Options that affect the usage of continuously-valued variables. These options specify values that are used in a variety of different ways, depending on the chosen problem type. See appropriate sections for more details. -Q hardness, --alpha=hardness The harshness of hardness of a limit that must be adhered to. Default 0.0 (limits disabled). -q num, --min-rand-input=num Minimum value for continuous variables. Default 0.0. -w num, --max-rand-input=num Maximum value for continuous variables. Default 1.0. -R num, --discretize-threshold=num Split a continuous domain into two pieces. This option maybe be used multiple times to split a continuous domain into multiple pieces: that is, n uses of this option will create n+1 domains. Demo options These options pertain to the various built-in demo and example problem modes. Such demo problems are commonly used to evaluate different machine learning algorithms, and are thus included here to facilitate such comparison, as well as to simplify moses regression and perfor‐ mance testing. -H type, --problem-type=type A number of demonstration problems are supported. In each case, the top results are printed to stdout, as a score, followed by a combo program. type may be one of: cp Combo program regression. The scoring function is based on the combo program specified with the -y flag. That is, the goal of the run is to deduce and learn the specified combo program. When specifying combo programs with continuous variables in them, be sure to use the -q, -w and -b flags to specify a range of input values to be sampled. In order to deter‐ mine the fitness of any candidate, it must be compared to the specified combo program. The comparison is done at a variety of different input values. If the range of sampled input values is inappropriate, or if there are not enough sampled values, then the fitness function may select unex‐ pected, undesired candidates. dj Disjunction problem. The scoring function awards a result that is a boolean disjunction (or) of N boolean-valued variables. The resulting combo program should be or($1 $2 ...). The size of the problem may be specified with the -k option. mux Multiplex problem. The scoring function models a boolean digital multiplexer, that is, an electronic circuit where an "address" of n bits selects one and only one line, out of 2^n possible lines. Thus, for example, a single address bit can select one of two possible lines: the first, if its false, and the second, if its true. The -k option may be used to specify the value of n. The actual size of the problem, measured in bits, is n+2^n and so increases expo‐ nentially fast. pa Even parity problem. The resulting combo program computes the parity of k bits, evaluating to true if the parity is even, else evaluating to false. The size of the problem may be specified with the -k option. sr Polynomial regression problem. Given the polynomial p(x)=x+x^2+x^3+...x^k, this searches for the shortest pro‐ gram consisting of nested arithmetic operators to compute p(x), given x as a free variable. The arithmetic operators would be addition, subtraction, multiplication and divi‐ sion; exponentiation is not allowed in the solution. So, for example, using the -k2 option to specify the order-2 polynomial x+x^2, then the shortest combo program is *(+(1 $1) $1) (that is, the solution is p(x)=x(x+1) in the usual arithmetical notation). -k size, --problem-size=size Specify the size of the problem. The interpretation of size depends on the particular problem type. -y prog, --combo-program=prog Specify the combo program to be learned, when used in combina‐ tion with the -H cp option. Thus, for example, -H cp -y "and(\$1 \$2)" specifies that the two-input conjunction is to be learned. Keep in mind that $ is a reserved character in many shells, and thus must be escaped with a backslash in order to be passed to moses. Complexity Penalty The speed with which the search algorithm can find a reasonable solu‐ tion is significantly affected by the complexity ratio specified with the -z or -p options. This section provides the theoretical underpin‐ ning for the meaning of these flags, and how they affect the the algo‐ rithm. The complexity penalty has two slightly different interpreta‐ tions, depending on whether one is considering learning a discretely- valued problem (i.e. boolean-valued) or a continuously-valued problem. The general structure of the argument is broadly similar for both cases; they are presented below. Similar arguments apply for classifi‐ cation problems (learning to classify data into one of N categories), and for precision maximization. Discrete case Let M be the model to be learned (the combo program). Let D be the data, assumed to be a table of n inputs i_k and one output o, with each row in the form: i_1 ... i_n o Here, i_k is the k'th input and o the output. In the below, we write o = D(x) where x=(i_1, ..., i_n) is an input data row. We want to assess the probability P(M|D) of the model M conditioned on the data D. In particular, we wish to maximize this, as it provides the fitness function for the model. According to Bayes theorem, P(M|D) = P(D|M) * P(M) / P(D) Consider the log likelihood LL(M) of M knowing D. Since D is constant, we can ignore P(D), so: LL(M) = log(P(D|M)) + log(P(M)) Assume each output of M on row x has probability p of being wrong. So, P(D|M) = Prod_{x\in D} [p*(M(x) != D(x)) + (1-p)*(M(x) == D(x))] where D(x) the observed result given input x. Then, log P(D|M) = Sum_{x\in D} log[p*(M(x) != D(x)) + (1-p)*(M(x) == D(x))] Let D = D_eq \cup D_ne where D_eq and D_ne are the sets D_eq = {x \in D | M(x) == D(x) } D_ne = {x \in D | M(x) != D(x) } Then log P(D|M) = Sum_{x\in D_ne} log(p) + Sum_{x\in D_eq} log(1-p) = |D_ne| log(p) + |D_eq| log(1-p) = |D_ne| log(p) + |D| log(1-p) - |D_ne| log(1-p) = |D_ne| log(p/(1-p)) + |D| log(1-p) Here, |D| is simply the size of set D, etc. Assuming that p is small, i.e. much less than one, then, to second order in p: log(1-p) = -p + p^2/2 + O(p^3) So: log P(D|M) = |D_ne| log(p) - p (|D| - |D_ne|) + O(p^2) Next, assume P(M) is distributed according to Solomonoff's Universal Distribution, approximated by (for now) P(M) = |A|^-|M| = exp(-|M|*log(|A|)) where A is the alphabet of the model, |A| is the alphabet size, and |M| is the complexity of the model. Note that this distribution is identi‐ cal to the Boltzmann distribution, for an inverse temperature of log(|A|). Putting it all together, the log-likelihood of M is: LL(M) = -|M|*log(|A|) + |D_ne| log(p/(1-p)) + |D| log(1-p) To get an expression usable for a scoring function, just bring out the |D_ne| by dividing by -log(p/(1-p)), to get score(M) = - [ LL(M) - |D| log(1-p) ] / log(p/(1-p)) = -|D_ne| + |M|*log|A| / log(p/(1-p)) = -|D_ne| - |M| |C_coef| Note that, since p<1, that log(p) is negative, and so the second term is negative. It can be understood as a complexity penalty. That is, we define the complexity penalty as complexity_penalty = |M| |C_coef| The complexity ratio, as set by the -z option, is given by complexity_ratio = 1 / |C_coef| By contrast, the -p option may be used to set p directly, as given in the formulas above. The value of |A| is computed internally, depending on the specific problem type (discrete vs. continuous, number of included-excluded operators, etc.) The complexity of each solution is also computed, using an ad-hoc complexity measure. Continuous case A similar argument to the above holds for the case of a continuously- valued observable. Let dP(..) be the notation for a probability density (or measure). As before, start with Bayes theorem: dP(M|D) = dP(D|M) * P(M) / P(D) Since D is constant, one may ignore the prior P(D), and write the log likelihood of M knowing D as: LL(M) = log(dP(D|M)) + log(P(M)) Assume the output of of the model M on input x has a Gaussian distribu‐ tions, of mean M(x) and variance V, so that dP(D|M), the probability density of the data D given the modem M is: dP(D|M) = Prod_{x\in D} (2*Pi*V)^(-1/2) exp(-(M(x)-D(x))^2/(2*V)) As before, assume a model distribution of P(M) = |A|^-|M| where |A| is the alphabet size and |M| the complexity of the model. After simplification, and dropping a constant term that does not depend on either the model complexity or the dataset itself (the dataset size is a constant), one then can deduce a scoring function: score(M) = -|M|*log(|A|)*2*V - Sum_{x\in D} (M(x)-D(x))^2 As before, |M|*log(|A|)*2*V can be interpreted as a scoring penalty. Alternately, one may interpret each row x as a feature; then the penalty term |M|*log(|A|)*2*V can be interpreted as an additional fea‐ ture that must be fit. Distributed processing MOSES provides two different styles of distributed processing for clus‐ ter computing systems. One style is to use MPI (as implemented in the OpenMPI/MPICH2 systems), the second is to use SSH. The first style is best suited for local area networks (LANs) and compute clusters. The second style allows operation over the open Internet, but is more prob‐ lematic, as it may require manual cleanup of log files and failed jobs. Because MPI is easy to install and manage, it is the recommended method for distributing moses operation across many machines. When moses is run in distributed fashion, one single node, the root node, maintains control over a pool of workers that execute on remote nodes. The root node maintains the set of candidate solutions, and assigns these to the workers for additional exploration, as the workers become free. The results are automatically collected by the root, and are automatically merged into the candidate population. When termina‐ tion criteria are met, processing will terminate on all nodes, and the root node will report the merged, best results. MPI Using MPI requires a moses binary with MPI support compiled in, with either the OpenMPI or the MPICH2 implementations. MPI support in MOSES is enabled with the --mpi=1 command-line flag. When this flag is spec‐ ified, moses may be run as usual in an MPI environment. Details will vary from one system configuration to another, but a typical usage might resemble the following: mpirun -n 15 --host‐ file mpd.hosts moses --mpi=1 -j 12 <other moses params> The above specifies that the run should be distributed over fifteen nodes, with the node names specified in the mpd.hosts file. The --mpi=1 flag indicates to moses that it is being run in an MPI environ‐ ment. The -j12 flag tells moses to use up to twelve threads on each node, whenever possible; this example assumes each node has 12 cores per CPU. To maximize CPU utilization, it seems best to specify two MPI instances per node. This is because not all parts of moses are parallelized, and some parts are subject to lock contention. Thus, running multiple instances per node seems to be an effective way to utilize all avail‐ able compute power on that node. It is almost always the case that moses RAM usage is relatively small, and so RAM availability is rarely a problem. The network utilization by moses is also very modest: the only network traffic is the reporting of candidate solutions, and so the network demands are typically in the range of 1 Megabit per second, and are thus easily supported on an Ethernet connection. The moses workload distributes in an 'embarrassingly parallel' fashion, and so there is no practical limit to scaling on small and medium compute clusters. When performing input data regression, be sure that the input data file is available on all nodes. This is most easily achieved by placing the input data file on a shared filesystem. Each instance of moses will write a log file. In order to avoid name collision on the log files, the process id (PID) will automatically be incorporated into the log-file name when the --mpi=1 option is specified. Log file creation can be disabled with the -lNONE option; however, this is not recom‐ mended, as it makes debugging and progress monitoring difficult. SSH The SSH style of distributed processing uses the ssh command to estab‐ lish communications and to control the pool of workers. Although moses provides job control when the system is running normally, it does not provide any mechanism for cleaning up after hung or failed jobs; this is outside the scope of the ssh implementation. The use of a job man‐ ager, such as LSF, is recommended. Remote machines are specified using the -j option, using the notation -j N:REMOTE_HOST. Here, N is the number of threads to use on the machine REMOTE_HOST. For instance, one can enter the options -j4 -j16:my_server.org (or -j16:user@my_server.org if one wishes to run the remote job under a different user name), meaning that 4 threads are allocated on the local machine and 16 threads are allocated on my_server.org. Password prompts will appear unless ssh-agent is being used. The moses executable must be on the remote machine(s), and located in a directory included in the PATH environment variable. Beware that a lot of log files are going to be generated on the remote machines. TODO Finish documenting these algo flags: --fs-scorer -M --diversity-pressure --diversity-exponent --diversity-normalize --diversity-dst --diversity-p-norm --diversity-dst2dp -R discretize target var These input flags: -G Interesting patterns flags: -J -K -U -X SEE ALSO The eval-table man page. More information is available at http://wiki.opencog.org/w/MOSES AUTHORS moses was written by Moshe Looks, Nil Geisweiller, and many others. This manual page is being written by Linas Vepstas. It is INCOMPLETE. 3.6.10 September 6, 2014 MOSES(1)

`
`