Exporting MOSES Models Into the Atomspace
Contents
Theory
Often we need to pipe MOSES models into the AtomSpace to allow PLN to infer new knowledge.
Also a basic video tutorial on using MOSES with Nil Geisweiller is here
Thoughts on the Future of MOSES and the Atompsace
Currently, OpenCog’s MOSES “probabilistic evolutionary learning” component is implemented separately from the Atomspace (OpenCog’s core knowledge store), although one can export programs learned via MOSES into the Atomspace. This is acceptable but is an imperfect design that persists for historical reasons. OpenCog is considering Reimplementing MOSES in the Atomspace Atop the Pattern Miner - in which case one could use the Pattern Miner infrastructure, with only fairly modest modification, to implement a variety of MOSES that operates in the Atomspace. This would have various advantages, including enabling the hybridization of pattern mining and MOSES in various ways.
Practice
The Basic Mapping section is as the name suggests - a basic hands on to get MOSES models into the Atomspace (including the files used). Following this is a more detailed tutorial called 'Mapping MOSES models into the AtomSpace (Detailed)'.
Basic Mapping
There are several scripts to help export MOSES models into the Atomspace, though they may not all be required.Firstly one needs to have existing MOSES models to import into the atomspace. Here is a hands on tutorial on create models in MOSES
Files
All these files can be found in this git repo
export_models_and_fitness.sh
Used to convert models and their scores into a scheme readily dumpable into the AtomSpace
export_models_and_fitness.sh (code) |
---|
#!/bin/bash
# Overview
# --------
# Little script to export in scheme format (readily dumpable into the
# AtomSpace) the models and their scores, given to a CSV, following
# Mike's format, 3 columns, the combo program, its recall (aka
# sensitivity) and its precision (aka positive predictive value).
#
# Usage
# -----
# Run it without argument to print the usage.
#
# Description
# -----------
# The model will be labeled FILENAME:moses_model_INDEX
#
# where FILENAME is the basename of the filename provided in argument,
# and INDEX is the row index of the model (starting by 0)
#
# The exported hypergraphs are
#
# 1. The model itself associated with its label (MODEL_PREDICATE_NAME)
#
# EquivalenceLink <1, 1>
# PredicateNode MODEL_PREDICATE_NAME
# MODEL
#
# 2. The label associated with its accuracy
#
# EvaluationLink <accuracy>
# PredicateNode "accuracy"
# ListLink
# PredicateNode PREDICATE_MODEL_NAME
# PredicateNode TARGET_FEATURE_NAME
#
# 3. The label associated with its balanced accuracy [REMOVED]
#
# EvaluationLink <balanced_accuracy>
# PredicateNode "balancedAccuracy"
# ListLink
# PredicateNode PREDICATE_MODEL_NAME
# PredicateNode TARGET_FEATURE_NAME
#
# 4. The label associated with its precision [REMOVED]
#
# ImplicationLink <precision>
# PredicateNode MODEL_PREDICATE_NAME
# PredicateNode TARGET_FEATURE_NAME
#
# 5. The label associated with its recall
#
# ImplicationLink <recall>
# PredicateNode TARGET_FEATURE_NAME
# PredicateNode MODEL_PREDICATE_NAME
set -u # raise error on unknown variable read
# set -x # debug trace
####################
# Source common.sh #
####################
PRG_PATH="$(readlink -f "$0")"
PRG_DIR="$(dirname "$PRG_PATH")"
. "$PRG_DIR/common.sh"
####################
# Program argument #
####################
if [[ $# == 0 || $# -gt 3 ]]; then
echo "Usage: $0 MODEL_CSV_FILE [-o OUTPUT_FILE]"
echo "Example: $0 chr10_moses.5x10.csv -o chr10_moses.5x10.scm"
exit 1
fi
readonly MODEL_CSV_FILE="$1"
readonly BASE_MODEL_CSV_FILE="$(basename "$MODEL_CSV_FILE")"
shift
OUTPUT_FILE="/dev/stdout"
while getopts "o:" opt; do
case $opt in
o) OUTPUT_FILE="$OPTARG"
;;
esac
done
#############
# Functions #
#############
# Given
#
# 1. a model predicate name
#
# 2. a combo model
#
# return a scheme code defining the equivalence between the model name
# and the model:
#
# EquivalenceLink <1, 1>
# PredicateNode MODEL_PREDICATE_NAME
# MODEL
model_name_def() {
local name="$1"
local model="$2"
cat <<EOF
(EquivalenceLink (stv 1.0 1.0)
(PredicateNode "${name}")
$model)
EOF
}
# Given
#
# 1. a model predicate name
#
# 2. a target feature name
#
# 3. an accuracy
#
# return a scheme code relating the model predicate with the accuracy:
#
# EvaluationLink <accuracy>
# PredicateNode "accuracy"
# ListLink
# PredicateNode PREDICATE_MODEL_NAME
# PredicateNode TARGET_FEATURE_NAME
model_accuracy_def() {
local name="$1"
local target="$2"
local accuracy="$3"
cat <<EOF
(EvaluationLink (stv $accuracy 1)
(PredicateNode "accuracy")
(ListLink
(PredicateNode "$name")
(PredicateNode "$target")))
EOF
}
# Like above but for balanced accuracy
model_balanced_accuracy_def() {
local name="$1"
local target="$2"
local accuracy="$3"
cat <<EOF
(EvaluationLink (stv $accuracy 1)
(PredicateNode "balancedAccuracy")
(ListLink
(PredicateNode "$name")
(PredicateNode "$target")))
EOF
}
# Given
#
# 1. a model predicate name
#
# 2. a target feature name
#
# 3. a precision
#
# return a scheme code relating the model predicate with its precision:
#
# ImplicationLink <precision>
# PredicateNode PREDICATE_MODEL_NAME
# PredicateNode TARGET_FEATURE_NAME
model_precision_def() {
local name="$1"
local target="$2"
local precision="$3"
cat <<EOF
(ImplicationLink (stv $precision 1)
(PredicateNode "$name")
(PredicateNode "$target"))
EOF
}
# Given
#
# 1. a model predicate name
#
# 2. a target feature name
#
# 3. a recall
#
# return a scheme code relating the model predicate with its recall:
#
# ImplicationLink <recall>
# PredicateNode TARGET_FEATURE_NAME
# PredicateNode PREDICATE_MODEL_NAME
model_recall_def() {
local name="$1"
local target="$2"
local recall="$3"
cat <<EOF
(ImplicationLink (stv $recall 1)
(PredicateNode "$target"))
(PredicateNode "$name")
EOF
}
########
# Main #
########
# Count the number of models and how to pad their unique numeric ID
rows=$(nrows "$MODEL_CSV_FILE")
npads=$(python -c "import math; print int(math.log($rows, 10) + 1)")
# Check that the header is correct (if not maybe the file format has
# changed)
header=$(head -n 1 "$MODEL_CSV_FILE")
expected_header='"","Sensitivity","Pos Pred Value"'
if [[ "$header" != "$expected_header" ]]; then
fatalError "Wrong header format: expect '$expected_header' but got '$header'"
fi
# Create a temporary pipe and save the scheme code
tmp_pipe=$(mktemp -u)
mkfifo "$tmp_pipe"
OLDIFS="$IFS"
IFS=","
i=0 # used to give unique names to models
while read combo recall precision; do
# Output model name predicate associated with model
model_name="${BASE_MODEL_CSV_FILE}:moses_model_$(pad $i $npads)"
scm_model="$(combo-fmt-converter -c "$combo" -f scheme)"
echo "$(model_name_def "$model_name" "$scm_model")"
# Output model precision
echo "$(model_precision_def "$model_name" aging $precision)"
# Output model recall
echo "$(model_recall_def "$model_name" aging $recall)"
((++i))
done < <(tail -n +2 "$MODEL_CSV_FILE") > "$OUTPUT_FILE"
IFS="$OLDIFS" |
relate_features_and_genes.sh
Used to generate scheme code to relate MOSES features and their corresponding genes
relate_features_and_genes.sh (code) |
---|
#!/bin/bash
# Scripts that take a feature CSV file and generate the corresponding
# hypergraphs relating them to geneNodes. That is for each feature of
# name <GENE_NAME> produce:
#
# EquivalenceLink <1, 1>
# PredicateNode <GENE_NAME>
# LambdaLink
# VariableNode $X
# EvaluationLink
# PredicateNode "overexpressed"
# ListLink
# GeneNode <GENE_NAME>
# $X
set -u
# set -x
####################
# Source common.sh #
####################
PRG_PATH="$(readlink -f "$0")"
PRG_DIR="$(dirname "$PRG_PATH")"
. "$PRG_DIR/common.sh"
####################
# Program argument #
####################
if [[ $# == 0 || $# -gt 3 ]]; then
echo "Usage: $0 FEATURE_CSV_FILE [-o OUTPUT_FILE]"
echo "Example: $0 oldvscontrolFeatures.csv -o oldvscontrol-features-and-genes.scm"
exit 1
fi
readonly FEATURE_CSV_FILE="$1"
shift
OUTPUT_FILE="/dev/stdout"
while getopts "o:" opt; do
case $opt in
o) OUTPUT_FILE="$OPTARG"
;;
esac
done
########
# Main #
########
# Check that the header is correct (if not maybe the file format has
# changed)
header=$(head -n 1 "$FEATURE_CSV_FILE")
expected_header='"feature","Freq","level"'
if [[ "$header" != "$expected_header" ]]; then
fatalError "Wrong header format: expect '$expected_header' but got '$header'"
fi
OLDIFS="$IFS"
IFS=","
while read feature freq level; do
cat <<EOF
(EquivalenceLink
(PredicateNode $feature)
(LambdaLink
(VariableNode "\$X")
(EvaluationLink
(PredicateNode "overexpressed")
(ListLink
(GeneNode $feature)
(VariableNode "\$X"))))
EOF
done < <(tail -n +2 "$FEATURE_CSV_FILE") > "$OUTPUT_FILE"
IFS="$OLDIFS" |
test.sh
(Obsolete) An obsolete script to experiment with MOSES learning and PLN reasoning.. You may need to configure settings.sh (i.e. for setting your OpenCog path). Usage is as follows:
mkdir <MY_EXP> cd <MY_EXP> ../scripts/test.sh ../scripts/settings.sh
test.sh (code) |
---|
#!/bin/bash
# Script test to attempt to load MOSES models in scheme format to the
# AtomSpace so that PLN can then reason on them.
#
# It performs the following
#
# 1. Launch an OpenCog server
#
# 2. Load background knowledge from a Scheme file (like feature
# definitions)
#
# 3. Split dataset into k-fold train and test sets
#
# 4. Run MOSES on some problem
#
# 5. Parse the output and pipe it in OpenCog
#
# 6. Use PLN to perform reasoning, etc.
set -u
# set -x
if [[ $# != 1 ]]; then
echo "Usage: $0 SETTINGS_FILE"
exit 1
fi
#############
# Constants #
#############
PRG_PATH="$(readlink -f "$0")"
PRG_DIR="$(dirname "$PRG_PATH")"
ROOT_DIR="$(dirname "$PRG_DIR")"
SET_PATH="$1"
SET_BASENAME="$(basename "$SET_PATH")"
#############
# Functions #
#############
# Given an error message, display that error on stderr and exit
fatalError() {
echo "[ERROR] $@" 1>&2
exit 1
}
warnEcho() {
echo "[WARN] $@"
}
infoEcho() {
echo "[INFO] $@"
}
# Convert human readable integer into machine full integer. For
# instance $(hr2i 100K) returns 100000, $(hr2i 10M) returns 10000000.
hr2i() {
local val=$1
local val=${val/M/000K}
local val=${val/K/000}
echo $val
}
# Pad $1 symbol with up to $2 0s
pad() {
local pad_expression="%0${2}d"
printf "$pad_expression" "$1"
}
# Split the data into train and test, renaming FILENAME.csv by
# FILENAME_train.csv and FILENAME_test.csv given
#
# 1. Dataset csv file with header
#
# 2. A ratio = train sample size / total size
#
# 3. A random seed
train_test_split() {
local DATAFILE="$1"
local RATIO="$2"
# Reset random seed
RANDOM="$3"
# Define train and test outputs
local DATAFILE_TRAIN=${DATAFILE//.csv/_train.csv}
local DATAFILE_TEST=${DATAFILE//.csv/_test.csv}
# Copy header into train and test files
head -n 1 "$DATAFILE" > "${DATAFILE_TRAIN}"
head -n 1 "$DATAFILE" > "${DATAFILE_TEST}"
# Subsample
while read line; do
if [[ $(bc <<< "$RATIO * 32767 >= $RANDOM") == 1 ]]; then
echo "$line" >> "${DATAFILE_TRAIN}"
else
echo "$line" >> "${DATAFILE_TEST}"
fi
done < <(tail -n +2 "$DATAFILE")
}
# Given
#
# 1. a model name
#
# 2. a combo model
#
# return a scheme code defining the equivalence between the model name
# and the model.
model_def() {
name="$1"
model="$2"
echo "(EquivalenceLink (stv 1.0 1.0) (PredicateNode \"${name}\") $model)"
}
########
# Main #
########
# 0. Copy in experiment dir and source settings
infoEcho "Copy $SET_PATH to current directory"
cp "$SET_PATH" .
. "$SET_BASENAME"
# 1. Launch an OpenCog server
infoEcho "Launch cogserver"
cd "$opencog_repo_path/scripts/"
./run_cogserver.sh "$build_dir_name" &
cd -
sleep 5
# 2. Load background knowledge
infoEcho "Load background knowledge"
if [[ "$scheme_file_path" =~ ^[^/] ]]; then # It is relative
scheme_file_path="$ROOT_DIR/$scheme_file_path"
fi
(echo "scm"; cat "$scheme_file_path") \
| "$opencog_repo_path/scripts/run_telnet_cogserver.sh"
# 3. Create train and test data
infoEcho "Create train and test data"
if [[ "$data_path" =~ ^[^/] ]]; then # It is relative
data_path="$ROOT_DIR/$data_path"
fi
cp $data_path .
data_basename="$(basename "$data_path")"
train_test_split "$data_basename" "$train_ratio" "$init_seed"
data_basename_train=${data_basename//.csv/_train.csv}
data_basename_test=${data_basename//.csv/_test.csv}
# 4. Run MOSES
infoEcho "Run MOSES"
moses_output_file=results.moses
. "$PRG_DIR/moses.sh"
# 5. Parse MOSES output and pipe it in OpenCog
infoEcho "Load MOSES models into the AtomSpace"
(echo "scm";
i=0
while read line; do
moses_model_name="moses_$(pad $i 3)"
echo "$(model_def "$moses_model_name" "$line")"
((++i))
done < "$moses_output_file"
) | "$opencog_repo_path/scripts/run_telnet_cogserver.sh"
# 6. Use PLN to perform reasoning, etc.
# TODO
# 7. Kill cogserver |
Mapping MOSES models into the AtomSpace (Detailed)
Here are more detailed suggestions about how to go about mapping models, their scores, and their features into AtomSpace hypergraphs.
Models
Models are exported in the following format:
EquivalenceLink <1, 1> PredicateNode <MODEL_PREDICATE_NAME> <MODEL_BODY>
Features
We need to related GeneNodes, used in the GO description, and PredicateNodes, used in the MOSES models. For that I suggest to use the predicate "overexpressed" as follows:
EquivalenceLink <1, 1> PredicateNode <GENE_NAME> LambdaLink VariableNode "$X" EvaluationLink PredicateNode "overexpressed" ListLink GeneNode <GENE_NAME> VariableNode "$X"
which says that PredicateNode over individual $X is equivalent to "GeneNode is overexpressed in individual $X".
Fitnesses
Here we will discuss two fitnesses (as used by Mike): accuracy (1 - score, in Mike's terminology) and precision. We will then discuss confidence.
Accuracy
We define an Accuracy predicate, that takes a model and dataset (or target feature) as arguments. The model, $M, is itself a predicate that evaluates to 1 (the confidence is let aside for now) when the individual $X is classified positively, 0 when it is classified negatively.
Similarity the target feature, $D, is also a predicate that evaluates to 1 when the individual $X has its target feature active, 0 otherwise.
EquivalenceLink <1, 1> LambdaLink VariableList $M $D EvaluationLink PredicateNode "accuracy" ListLink $M $D LambdaLink VariableList $M $D AverageLink $X EquivalenceLink ExecutionOutputLink GetStrength EvaluationLink $M $X ExecutionOutputLink GetStrength EvaluationLink $D $X
It turns out the on the AverageLink is going to match the accuracy, given $M and $D. Indeed, the accuracy is the average number of times the model is correct with respect to the dataset. With this representation, given the dataset and the model, PLN can directly build the Accuracy predicate.
In the absence of dataset, and given the accuracy of each model, one may directly write down the Accuracy predicate for each model, and target feature:
EvaluationLink <model accuracy> PredicateNode "accuracy" ListLink PredicateNode <MODEL> PredicateNode <TARGET FEATURE>
Precision
The cool thing about precision is that it translates directly into an Implication strength - that is:
ImplicationLink <TV.s = model precision> PredicateNode <MODEL> PredicateNode <TARGET FEATURE>
Indeed, According to PLN (assuming all individuals are equiprobable)
where correspond to the predicate of a model, runs over the individuals of the dataset.
This corresponds indeed to the precision:
as is indeed the number of positively classified individuals , and the number of correctly classified individuals, .
Recall
Similarly recall is easily translated into an Implication strength - that is:
ImplicationLink <TV.s = model precision> PredicateNode <TARGET FEATURE> PredicateNode <MODEL>
given that:
Confidence
The confidence can be:
where is the number of individuals, and is a parameter.
Quiz
Notes
Maintained by: Nil Priority: Medium Priority
This tutorial was adapted from https://github.com/opencog/agi-bio/blob/master/moses-scripts/README.md and https://github.com/opencog/agi-bio/blob/master/moses-scripts/export_models.md
Bunch of scripts to run MOSES, pipe the models into OpenCog and apply PLN to infer new knowledge: