Exporting MOSES Models Into the Atomspace

From OpenCog
Jump to: navigation, search

Theory

Often we need to pipe MOSES models into the AtomSpace to allow PLN to infer new knowledge.

Also a basic video tutorial on using MOSES with Nil Geisweiller is here

Thoughts on the Future of MOSES and the Atompsace

Currently, OpenCog’s MOSES “probabilistic evolutionary learning” component is implemented separately from the Atomspace (OpenCog’s core knowledge store), although one can export programs learned via MOSES into the Atomspace. This is acceptable but is an imperfect design that persists for historical reasons. OpenCog is considering Reimplementing MOSES in the Atomspace Atop the Pattern Miner - in which case one could use the Pattern Miner infrastructure, with only fairly modest modification, to implement a variety of MOSES that operates in the Atomspace. This would have various advantages, including enabling the hybridization of pattern mining and MOSES in various ways.

Practice

The Basic Mapping section is as the name suggests - a basic hands on to get MOSES models into the Atomspace (including the files used). Following this is a more detailed tutorial called 'Mapping MOSES models into the AtomSpace (Detailed)'.

Basic Mapping

Moses Models to Atomspace.png
There are several scripts to help export MOSES models into the Atomspace, though they may not all be required.

Firstly one needs to have existing MOSES models to import into the atomspace. Here is a hands on tutorial on create models in MOSES

Files

All these files can be found in this git repo

export_models_and_fitness.sh

Used to convert models and their scores into a scheme readily dumpable into the AtomSpace

export_models_and_fitness.sh (code)  
#!/bin/bash

# Overview
# --------
# Little script to export in scheme format (readily dumpable into the
# AtomSpace) the models and their scores, given to a CSV, following
# Mike's format, 3 columns, the combo program, its recall (aka
# sensitivity) and its precision (aka positive predictive value).
#
# Usage
# -----
# Run it without argument to print the usage.
#
# Description
# -----------
# The model will be labeled FILENAME:moses_model_INDEX
#
# where FILENAME is the basename of the filename provided in argument,
# and INDEX is the row index of the model (starting by 0)
#
# The exported hypergraphs are
#
# 1. The model itself associated with its label (MODEL_PREDICATE_NAME)
#
# EquivalenceLink <1, 1>
#     PredicateNode MODEL_PREDICATE_NAME
#     MODEL
#
# 2. The label associated with its accuracy
#
# EvaluationLink <accuracy>
#     PredicateNode "accuracy"
#     ListLink
#         PredicateNode PREDICATE_MODEL_NAME
#         PredicateNode TARGET_FEATURE_NAME
#
# 3. The label associated with its balanced accuracy [REMOVED]
#
# EvaluationLink <balanced_accuracy>
#     PredicateNode "balancedAccuracy"
#     ListLink
#         PredicateNode PREDICATE_MODEL_NAME
#         PredicateNode TARGET_FEATURE_NAME
#
# 4. The label associated with its precision [REMOVED]
#
# ImplicationLink <precision>
#     PredicateNode MODEL_PREDICATE_NAME
#     PredicateNode TARGET_FEATURE_NAME
#
# 5. The label associated with its recall
#
# ImplicationLink <recall>
#     PredicateNode TARGET_FEATURE_NAME
#     PredicateNode MODEL_PREDICATE_NAME

set -u                          # raise error on unknown variable read
# set -x                          # debug trace

####################
# Source common.sh #
####################
PRG_PATH="$(readlink -f "$0")"
PRG_DIR="$(dirname "$PRG_PATH")"
. "$PRG_DIR/common.sh"

####################
# Program argument #
####################
if [[ $# == 0 || $# -gt 3 ]]; then
    echo "Usage: $0 MODEL_CSV_FILE [-o OUTPUT_FILE]"
    echo "Example: $0 chr10_moses.5x10.csv -o chr10_moses.5x10.scm"
    exit 1
fi

readonly MODEL_CSV_FILE="$1"
readonly BASE_MODEL_CSV_FILE="$(basename "$MODEL_CSV_FILE")"
shift
OUTPUT_FILE="/dev/stdout"
while getopts "o:" opt; do
    case $opt in
        o) OUTPUT_FILE="$OPTARG"
            ;;
    esac
done

#############
# Functions #
#############

# Given
#
# 1. a model predicate name
#
# 2. a combo model
#
# return a scheme code defining the equivalence between the model name
# and the model:
#
# EquivalenceLink <1, 1>
#     PredicateNode MODEL_PREDICATE_NAME
#     MODEL
model_name_def() {
    local name="$1"
    local model="$2"
    cat <<EOF
(EquivalenceLink (stv 1.0 1.0)
    (PredicateNode "${name}")
    $model)
EOF
}

# Given
#
# 1. a model predicate name
#
# 2. a target feature name
#
# 3. an accuracy
#
# return a scheme code relating the model predicate with the accuracy:
#
# EvaluationLink <accuracy>
#     PredicateNode "accuracy"
#     ListLink
#         PredicateNode PREDICATE_MODEL_NAME
#         PredicateNode TARGET_FEATURE_NAME
model_accuracy_def() {
    local name="$1"
    local target="$2"
    local accuracy="$3"
    cat <<EOF
(EvaluationLink (stv $accuracy 1)
    (PredicateNode "accuracy")
    (ListLink
        (PredicateNode "$name")
        (PredicateNode "$target")))
EOF
}

# Like above but for balanced accuracy
model_balanced_accuracy_def() {
    local name="$1"
    local target="$2"
    local accuracy="$3"
    cat <<EOF
(EvaluationLink (stv $accuracy 1)
    (PredicateNode "balancedAccuracy")
    (ListLink
        (PredicateNode "$name")
        (PredicateNode "$target")))
EOF
}

# Given
#
# 1. a model predicate name
#
# 2. a target feature name
#
# 3. a precision
#
# return a scheme code relating the model predicate with its precision:
#
# ImplicationLink <precision>
#     PredicateNode PREDICATE_MODEL_NAME
#     PredicateNode TARGET_FEATURE_NAME
model_precision_def() {
    local name="$1"
    local target="$2"
    local precision="$3"
    cat <<EOF
(ImplicationLink (stv $precision 1)
    (PredicateNode "$name")
    (PredicateNode "$target"))
EOF
}

# Given
#
# 1. a model predicate name
#
# 2. a target feature name
#
# 3. a recall
#
# return a scheme code relating the model predicate with its recall:
#
# ImplicationLink <recall>
#     PredicateNode TARGET_FEATURE_NAME
#     PredicateNode PREDICATE_MODEL_NAME
model_recall_def() {
    local name="$1"
    local target="$2"
    local recall="$3"
    cat <<EOF
(ImplicationLink (stv $recall 1)
    (PredicateNode "$target"))
    (PredicateNode "$name")
EOF
}

########
# Main #
########

# Count the number of models and how to pad their unique numeric ID
rows=$(nrows "$MODEL_CSV_FILE")
npads=$(python -c "import math; print int(math.log($rows, 10) + 1)")

# Check that the header is correct (if not maybe the file format has
# changed)
header=$(head -n 1 "$MODEL_CSV_FILE")
expected_header='"","Sensitivity","Pos Pred Value"'
if [[ "$header" != "$expected_header" ]]; then
    fatalError "Wrong header format: expect '$expected_header' but got '$header'"
fi

# Create a temporary pipe and save the scheme code
tmp_pipe=$(mktemp -u)
mkfifo "$tmp_pipe"

OLDIFS="$IFS"
IFS=","
i=0                             # used to give unique names to models
while read combo recall precision; do
    # Output model name predicate associated with model
    model_name="${BASE_MODEL_CSV_FILE}:moses_model_$(pad $i $npads)"
    scm_model="$(combo-fmt-converter -c "$combo" -f scheme)"
    echo "$(model_name_def "$model_name" "$scm_model")"

    # Output model precision
    echo "$(model_precision_def "$model_name" aging $precision)"

    # Output model recall
    echo "$(model_recall_def "$model_name" aging $recall)"

    ((++i))
done < <(tail -n +2 "$MODEL_CSV_FILE") > "$OUTPUT_FILE"
IFS="$OLDIFS"
relate_features_and_genes.sh

Used to generate scheme code to relate MOSES features and their corresponding genes

relate_features_and_genes.sh (code)  
#!/bin/bash

# Scripts that take a feature CSV file and generate the corresponding
# hypergraphs relating them to geneNodes.  That is for each feature of
# name <GENE_NAME> produce:
#
# EquivalenceLink <1, 1>
#     PredicateNode <GENE_NAME>
#     LambdaLink
#         VariableNode $X
#         EvaluationLink
#             PredicateNode "overexpressed"
#             ListLink
#                 GeneNode <GENE_NAME>
#                 $X

set -u
# set -x

####################
# Source common.sh #
####################
PRG_PATH="$(readlink -f "$0")"
PRG_DIR="$(dirname "$PRG_PATH")"
. "$PRG_DIR/common.sh"

####################
# Program argument #
####################
if [[ $# == 0 || $# -gt 3 ]]; then
    echo "Usage: $0 FEATURE_CSV_FILE [-o OUTPUT_FILE]"
    echo "Example: $0 oldvscontrolFeatures.csv -o oldvscontrol-features-and-genes.scm"
    exit 1
fi

readonly FEATURE_CSV_FILE="$1"
shift
OUTPUT_FILE="/dev/stdout"
while getopts "o:" opt; do
    case $opt in
        o) OUTPUT_FILE="$OPTARG"
            ;;
    esac
done

########
# Main #
########

# Check that the header is correct (if not maybe the file format has
# changed)
header=$(head -n 1 "$FEATURE_CSV_FILE")
expected_header='"feature","Freq","level"'
if [[ "$header" != "$expected_header" ]]; then
    fatalError "Wrong header format: expect '$expected_header' but got '$header'"
fi

OLDIFS="$IFS"
IFS=","
while read feature freq level; do
    cat <<EOF
(EquivalenceLink
    (PredicateNode $feature)
    (LambdaLink
        (VariableNode "\$X")
        (EvaluationLink
            (PredicateNode "overexpressed")
            (ListLink
                (GeneNode $feature)
                (VariableNode "\$X"))))
EOF
done < <(tail -n +2 "$FEATURE_CSV_FILE") > "$OUTPUT_FILE"
IFS="$OLDIFS"
test.sh

(Obsolete) An obsolete script to experiment with MOSES learning and PLN reasoning.. You may need to configure settings.sh (i.e. for setting your OpenCog path). Usage is as follows:

mkdir <MY_EXP>
cd <MY_EXP>
../scripts/test.sh ../scripts/settings.sh
test.sh (code)  
#!/bin/bash

# Script test to attempt to load MOSES models in scheme format to the
# AtomSpace so that PLN can then reason on them.
#
# It performs the following
#
# 1. Launch an OpenCog server
#
# 2. Load background knowledge from a Scheme file (like feature
# definitions)
#
# 3. Split dataset into k-fold train and test sets
#
# 4. Run MOSES on some problem
#
# 5. Parse the output and pipe it in OpenCog
#
# 6. Use PLN to perform reasoning, etc.

set -u
# set -x

if [[ $# != 1 ]]; then
    echo "Usage: $0 SETTINGS_FILE"
    exit 1
fi

#############
# Constants #
#############

PRG_PATH="$(readlink -f "$0")"
PRG_DIR="$(dirname "$PRG_PATH")"
ROOT_DIR="$(dirname "$PRG_DIR")"
SET_PATH="$1"
SET_BASENAME="$(basename "$SET_PATH")"

#############
# Functions #
#############

# Given an error message, display that error on stderr and exit
fatalError() {
    echo "[ERROR] $@" 1>&2
    exit 1
}

warnEcho() {
    echo "[WARN] $@"    
}

infoEcho() {
    echo "[INFO] $@"
}

# Convert human readable integer into machine full integer. For
# instance $(hr2i 100K) returns 100000, $(hr2i 10M) returns 10000000.
hr2i() {
    local val=$1
    local val=${val/M/000K}
    local val=${val/K/000}
    echo $val
}

# Pad $1 symbol with up to $2 0s
pad() {
    local pad_expression="%0${2}d"
    printf "$pad_expression" "$1"
}

# Split the data into train and test, renaming FILENAME.csv by
# FILENAME_train.csv and FILENAME_test.csv given
#
# 1. Dataset csv file with header
#
# 2. A ratio = train sample size / total size
#
# 3. A random seed
train_test_split() {
    local DATAFILE="$1"
    local RATIO="$2"

    # Reset random seed
    RANDOM="$3"

    # Define train and test outputs
    local DATAFILE_TRAIN=${DATAFILE//.csv/_train.csv}
    local DATAFILE_TEST=${DATAFILE//.csv/_test.csv}

    # Copy header into train and test files
    head -n 1 "$DATAFILE" > "${DATAFILE_TRAIN}"
    head -n 1 "$DATAFILE" > "${DATAFILE_TEST}"

    # Subsample
    while read line; do
        if [[ $(bc <<< "$RATIO * 32767 >= $RANDOM") == 1 ]]; then
            echo "$line" >> "${DATAFILE_TRAIN}"
        else
            echo "$line" >> "${DATAFILE_TEST}"
        fi
    done < <(tail -n +2 "$DATAFILE")
}

# Given
#
# 1. a model name
#
# 2. a combo model
#
# return a scheme code defining the equivalence between the model name
# and the model.
model_def() {
    name="$1"
    model="$2"
    echo "(EquivalenceLink (stv 1.0 1.0) (PredicateNode \"${name}\") $model)"
}

########
# Main #
########

# 0. Copy in experiment dir and source settings

infoEcho "Copy $SET_PATH to current directory"
cp "$SET_PATH" .
. "$SET_BASENAME"

# 1. Launch an OpenCog server

infoEcho "Launch cogserver"
cd "$opencog_repo_path/scripts/"
./run_cogserver.sh "$build_dir_name" &
cd -
sleep 5

# 2. Load background knowledge

infoEcho "Load background knowledge"
if [[ "$scheme_file_path" =~ ^[^/] ]]; then # It is relative
    scheme_file_path="$ROOT_DIR/$scheme_file_path"
fi

(echo "scm"; cat "$scheme_file_path") \
    | "$opencog_repo_path/scripts/run_telnet_cogserver.sh"

# 3. Create train and test data

infoEcho "Create train and test data"
if [[ "$data_path" =~ ^[^/] ]]; then # It is relative
    data_path="$ROOT_DIR/$data_path"
fi
cp $data_path .
data_basename="$(basename "$data_path")"
train_test_split "$data_basename" "$train_ratio" "$init_seed"
data_basename_train=${data_basename//.csv/_train.csv}
data_basename_test=${data_basename//.csv/_test.csv}

# 4. Run MOSES

infoEcho "Run MOSES"
moses_output_file=results.moses
. "$PRG_DIR/moses.sh"

# 5. Parse MOSES output and pipe it in OpenCog

infoEcho "Load MOSES models into the AtomSpace"
(echo "scm";
    i=0
    while read line; do
        moses_model_name="moses_$(pad $i 3)"
        echo "$(model_def "$moses_model_name" "$line")"
        ((++i))
    done < "$moses_output_file"
) | "$opencog_repo_path/scripts/run_telnet_cogserver.sh"

# 6. Use PLN to perform reasoning, etc.
# TODO

# 7. Kill cogserver

Mapping MOSES models into the AtomSpace (Detailed)

Here are more detailed suggestions about how to go about mapping models, their scores, and their features into AtomSpace hypergraphs.

Models

Models are exported in the following format:

EquivalenceLink <1, 1>
   PredicateNode <MODEL_PREDICATE_NAME>
   <MODEL_BODY>

Features

We need to related GeneNodes, used in the GO description, and PredicateNodes, used in the MOSES models. For that I suggest to use the predicate "overexpressed" as follows:

EquivalenceLink <1, 1>
    PredicateNode <GENE_NAME>
    LambdaLink
        VariableNode "$X"
        EvaluationLink
            PredicateNode "overexpressed"
            ListLink
                GeneNode <GENE_NAME>
                VariableNode "$X"

which says that PredicateNode over individual $X is equivalent to "GeneNode is overexpressed in individual $X".

Fitnesses

Here we will discuss two fitnesses (as used by Mike): accuracy (1 - score, in Mike's terminology) and precision. We will then discuss confidence.

Accuracy

We define an Accuracy predicate, that takes a model and dataset (or target feature) as arguments. The model, $M, is itself a predicate that evaluates to 1 (the confidence is let aside for now) when the individual $X is classified positively, 0 when it is classified negatively.

Similarity the target feature, $D, is also a predicate that evaluates to 1 when the individual $X has its target feature active, 0 otherwise.

EquivalenceLink <1, 1>
    LambdaLink
        VariableList
            $M
            $D
        EvaluationLink
            PredicateNode "accuracy"
            ListLink
                $M
                $D
    LambdaLink
        VariableList
            $M
            $D
        AverageLink
            $X
            EquivalenceLink
                ExecutionOutputLink
                    GetStrength 
                    EvaluationLink
                        $M
                        $X
                ExecutionOutputLink
                    GetStrength 
                    EvaluationLink
                        $D
                        $X

It turns out the on the AverageLink is going to match the accuracy, given $M and $D. Indeed, the accuracy is the average number of times the model is correct with respect to the dataset. With this representation, given the dataset and the model, PLN can directly build the Accuracy predicate.

In the absence of dataset, and given the accuracy of each model, one may directly write down the Accuracy predicate for each model, and target feature:

EvaluationLink <model accuracy>
    PredicateNode "accuracy"
    ListLink
        PredicateNode <MODEL>
        PredicateNode <TARGET FEATURE>
Precision

The cool thing about precision is that it translates directly into an Implication strength - that is:

ImplicationLink <TV.s = model precision>
   PredicateNode <MODEL>
   PredicateNode <TARGET FEATURE>

Indeed, According to PLN (assuming all individuals are equiprobable)

where correspond to the predicate of a model, runs over the individuals of the dataset.

This corresponds indeed to the precision:

as is indeed the number of positively classified individuals , and the number of correctly classified individuals, .

Recall

Similarly recall is easily translated into an Implication strength - that is:

ImplicationLink <TV.s = model precision>
   PredicateNode <TARGET FEATURE>
   PredicateNode <MODEL>

given that:

Confidence

The confidence can be:

where is the number of individuals, and is a parameter.

Quiz

1. Why would you want to export MOSES Models Into Atomspace?

Because it's fun
God will bless your first born
You can pattern mine the MOSES models in Atomspace
OpenCog requires MOSES models to exist in Atomspace for PLN to work
OpenCog NLP will be crippled without the laws of MOSES
MOSES is far too metaphysical, and needs to be grounded in a bunch of atoms in order for it to make sense
Well.. at some stage MOSES will be built into Atomspace - so at that time you won't need to export MOSES models into Atomspace
Because god will punish you if you don't

2. Is there any 10 commandments of using MOSES models in Atomspace?

True
False

Your score is 0 / 0


Notes

Maintained by: Nil Priority: Medium Priority

This tutorial was adapted from https://github.com/opencog/agi-bio/blob/master/moses-scripts/README.md and https://github.com/opencog/agi-bio/blob/master/moses-scripts/export_models.md

Bunch of scripts to run MOSES, pipe the models into OpenCog and apply PLN to infer new knowledge: