OpenCog Based Pattern Mining and Inference on Bio Knowledge

From OpenCog
Jump to: navigation, search

This document collects together a few examples regarding pattern mining and inference on “bio knowledge bases” consisting of Atoms created from MOSES modeling of gene expression / SNP datasets, and ingestion into the Atomspace of standard bio-ontologies.

The handful of examples given here barely scratch the surface, and are intended only as evocative.

This document is intended for “insiders” who understand something about OpenCog knowledge representation, MOSES based analysis of gene expression/SNP datasets, and the general character of biology ontologies. Others may be able to get the basic idea but may also miss something.


Representation of Gene Expression Values

A gene will be represented in the Atomspace by a GeneNode such as GeneNode "HNRPK"

From the standpoint of analyzing gene expression data, the features we are often looking at are best thought of as predicates of the form "Gene HNRPK, in this individual, has expression > {the median expression of Gene HNRPK across individuals in a particular dataset D}"

One fairly clean and simple way to represent this in the Atomspace is to use an Atom such as PredicateNode “Gene_HNRPK_overexpressed” If we have a dataset such as “Nonagenarian_Smith_2013” represented by ConceptNode “Nonagenarian_Smith_2013” then the semantics is that we could say EvaluationLink PredicateNode “Gene_HNRPK_overexpressed” ConceptNode “Nonagenarian_Smith_2013”

meaning that the gene in question is overexpressed in the dataset specified. Since the names of Atoms are only for human utilization, we would also need Atoms specifying the relation between the PredicateNode and an associated GeneNode, e.g. EquivalenceLink EvaluationLink PredicateNode “Gene_HNRPK_overexpressed” $X EvaluationLink PredicateNode “overexpressed” GeneNode “HNRPK” ConceptNode “Nonagenarian_Smith_2013”

EquivalenceLink EvaluationLink PredicateNode “overexpressed” $X $D GreaterThanLink ExecutionOutputLink SchemaNode “GeneExpression” $X $D ExecutionOutputLink SchemaNode “Median” ExecutionOutputLink SchemaNode “Map” SchemaNode “GeneExpression” SatisfyingSet MemberLink $X D

Currently we have no reasoning in OpenCog that is smart enough to deal with the latter construct, but this is not necessarily terribly far off, so it’s a good practice to explicitly encode the semantics of one’s predicates in the Atomspace for future use.

Representation of Boolean MOSES Models

A Boolean classification model, produced by MOSES, would look in the simplest case like ANDLink PredicateNode “Gene_HNRPK_overexpressed” PredicateNode “Gene_FHK1_overexpressed”

Or in a more complex case something like

ANDLink PredicateNode “Gene_HNRPK_overexpressed” ORLink PredicateNode “Gene_FHK1_overexpressed” NotLink PredicateNode “Gene_TPT6_overexpressed” ORLink PredicateNode “Gene_BRCA1_overexpressed” ANDLink PredicateNode “Gene_POOP_overexpressed” NOT PredicateNode “Gene_ALLAH17_overexpressed”

Basically, these are nested Boolean logic formulas, whose terminals are statements that one gene or another is overexpressed. These formulas are intended to be evaluated on some particular dataset, at which point they yield a true value relative to that dataset. So one should understand clearly that these models are being represented as FUNCTIONS, because a logical combination of two functions is another function. In this case the functions (i.e. predicates such as Gene_FKH1_overexpressed) are more directly interpreted mapping gene-expression datasets into Boolean outputs. On the other hand, the same functions could be overloaded and interpreted otherwise. If one interpreted Gene_FKH1_overexpressed as a number in [0,1] indicating the degree of overexpression (as calculated e.g. by OpenCog’s Quantitative Predicates module, which performs quantile normalization; or as quantile normalized externally), then one could evaluate the logic formulas output by MOSES using uncertain logic, which would be useful in many cases. MOSES can be run in other modes, e.g. to generate arithmetic formulas combining real-valued expression values (or more generally for learning programs involving loops and conditionals and other programming-language constructs). However, the Boolean mode of running MOSES is particularly nice for interfacing MOSES with logic systems (like PLN), because logic systems already are instrumented to handle logical operators.

Knowledge from Bio Databases

Examples of knowledge input to the Atomspace from biology databases are as follows. The main thing we have at present is membership of genes in categories, e.g. MemberLink GeneNode "HNRPK" ConceptNode "MSigDB_GeneSet: NUCLEOPLASM"

MemberLink GeneNode "XRCC6" ConceptNode: "GO:0048308"

The GO (Gene Ontology) and MSigDB are specific genomics-related knowledge bases.

We also have relationships between bio-ontology categories and terms, such as

EvaluationLink PredicateNode: "GO_synonym_exact" ListLink ConceptNode: "GO:0000001" ConceptNode: "mitochondrial inheritance"


Implication EvaluationLink PredicateNode: "GO_synonym_exact" ListLink $X $Y SimlarityLInk $X $Y


And we have relationships between bio-ontology categories, e.g.

InheritanceLink ConceptNode: "GO:0000001" ConceptNode: "GO:0048308"

EvaluationLink PredicateNode: "RO_part_of" ListLink ConceptNode: GO:0000001 ConceptNode: GO:0000278

(the latter indicates there is a part_of relation between the mitochondrion_inheritance category and the mitotic cell cycle category).

Probabilistic Relations Between Bio-Ontology Categories

The inheritance relations supplied with the GO and other bio-ontologies tend to be crisp. For inference and pattern mining purposes, however, it is desirable to create similar relationships with probabilistic truth values, e.g.

InheritanceLink <.3> ConceptNode: "GO:0000007" ConceptNode: "GO:0048308"

Many such relationships will exist in the GO, even e.g. within one branch of the GO (Biological Process, generally the most useful branch for the kind of work we’re doing). Different biological processes may have genes in common, since most genes are multifunctional. Formation of these links is an example of what in PLN/OpenCog is called “direct evaluation” – it’s not exactly abstract inference, because the link strengths can be derived via simply looking at the genes known to belong to each of the categories and calculating ratios. I’m not sure what is the best algorithm for computing these links. The old (now obsolete/deprecated) PythonPLN had an ExtensionalInheritance rule that calculated ExtensionalInheritance ConceptNode A ConceptNode B

based on direct evaluation of the members of A and B – i.e. by looking at all X so that MemberLink X A So, one approach would be to create a rule like this (or port the code for that rule, but the code isn’t much) for use with the new C++ PLN. The question is then when to invoke this rule. One option would be to write an InheritanceMiner MindAgent that performed the following steps:

Iterate through all Atoms meeting certain specified criteria (so perhaps it could be initialized with some sort of Iterator that goes through some set of Atoms). Call this list of Atoms S.

For every Atom A on the list S, when it’s visited during the iteration, find all other Atoms B in S that share at least one member with A (e.g. via following the MemberLinks pointing to A to get a set of nodes, then following the MemberLinks pointing from these nodes). Then, for each pair (A,B), apply the direct evaluation based ExtensionalInheritance rule.

In the application at hand, the Iterator in question would iterate through the ConceptNodes representing bio-ontology categories from the ontologies loaded in.

Calculating links like InheritanceLink <.3> ConceptNode: "GO:0000007" ConceptNode: "GO:0048308"

ExtensionalInheritance relationships are enough to obtain interesting, meaningful results on genomics data. However, more insight will be added to the analysis if we calculate and utilize IntensionalInheritance as well. This would be done in a manner structurally similar to what we've described above with IntensionalInheritanceLinks, i.e. implement an IntensionalInheritance rule that calculates IntensionalInheritance ConceptNode A ConceptNode B

based on evaluation of all the links coming out of A and B (according to the mathematics of intensional inheritance as specified on the OpenCog wiki site (corrected from the PLN book which has some typos in the pertinent section), and as implemented previously in the PythonPLN). The InheritanceMiner agent outlined above could then be extended to handle IntensionalInheritanceLink mining as well.

Mining Surprising Patterns

We now give some simple examples of surprising patterns that might be mined from an Atomspace containing MOSES models learned from genetics datasets and gene category relationships obtained from bio-ontologies.

We will start from a simple intuitive example, then cast it into the format of the current Pattern Miner. Suppose we have ConceptNode: "GO:0000001" <n=10> ConceptNode: "GO:0000002" <n=100> ConceptNode: "GO:0000003" <n=100>

(the n are count values, which indicate the number of items known to belong to the category.) Then suppose we have, as background knowledge InheritanceLink <.9> ConceptNode: "GO:0000001" ConceptNode: "GO:0000002"

InheritanceLink <.1> ConceptNode: "GO:0000001" ConceptNode: "GO:0000002"

InheritanceLink <.1> ConceptNode: "GO:0000002" ConceptNode: "GO:0000003"

In other words, roughly: Category 1 is a fairly small part of Category 2, and Category 2 is mostly not part of Category 3. If it were not known already, the first InheritanceLink given above would be a surprising pattern – the degree of membership of Category 1 within Category 2 is much higher than one would expect from random chance. Suppose next that InheritanceLink <.6> ConceptNode: "GO:0000001" ConceptNode: "GO:0000003"

-- i.e. Category 1 is heavily overlapping with Category 3. This would be a surprising pattern, because it presents a probability much higher than one would expect from random chance, but contradicts what one would assume from looking at the relation between Category 2 and Category 3. (NOTE: One option suggested by examples like this is to associate a “surprisingness” value to an Atom truth value, as an optional addition to strength and confidence. This could be an InformationalTruthValue extending a SimpleTruthValue. Since surprisingness depends on context these values would need to be updated relatively often, and would best be time-stamped with the time of last update. This would be useful in many contexts beyond biological pattern mining.) The current Pattern Miner looks for patterns containing VariableNodes; so, it might find the above pattern via identifying a surprising pattern of the form

AndLink <.6> MemberLink $X ConceptNode: "GO:0000001" MemberLink $X ConceptNode: "GO:0000003"

Roughly speaking, in the current design for the Pattern Miner, the “I-Surprisingness” captures “probability higher or lower than would be expected from random chance” type surprisingness, whereas the “II-Surprisingness” measure also captures “contradicts what one would assume from other relationships in the Atomspace” type surprisingness. Given a pattern of the above form, a simple inference step creates SimilarityLink ConceptNode: "GO:0000001" ConceptNode: "GO:0000003"

and another step creates InheritanceLink ConceptNode: "GO:0000001" ConceptNode: "GO:0000003"


An example of a more complex pattern that the Pattern Miner might find would be AndLink MemberLink $X ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" MemberLink $X ConceptNode “GO:0000005” EvaluationLink PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997”

If this combination (belonging to a specific MsigDB category AND a specific GO category, AND being overexpressed in a certain dataset) occurred surprisingly often in the Atomspace (and also surprisingly often relative to the other knowledge in the Atomspace), the the Pattern Miner would flag it as a surprising combination.

A Few Basic PLN Inferences

Next we give a few examples of the kinds of conclusions PLN inference might draw when applied across an Atomspace filled with biological knowledge of the sorts outlined above.

Deduction First a simple example of deduction. Suppose we know that

ImplicationLink <.5> Member $X ConceptNode: "GO:0000001" EvaluationLink <1,.9> PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997”

And suppose we know that

IntensionalSimilarityLink <.4,.9> ConceptNode: “Dataset:Smith_Wesson_1997” ConceptNode: “Dataset:Crosby_Stills_Nash_76”

(because these two datasets share lots of properties in general – maybe they both study aging in humans, and tend to highlight a lot of the same genes as important). Then a couple PLN steps (a conversion step plus a deduction step) will conclude that ImplicationLink <.7,.4> Member $X ConceptNode: "GO:0000001" EvaluationLink PredicateNode “overexpressed” $X ConceptNode: “Dataset:Crosby_Stills_Nash_76”

where the confidence of the conclusion is not very high due to the speculative nature of the inference (.4 is a plausible estimate of what one might get).

Abduction

Next, a simple example of (more speculative) abductive inference would be as follows. Suppose one found that ImplicationLink <.5> Member $X ConceptNode: "GO:0000001" EvaluationLink <1,.9> PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997”

ImplicationLink <.4> Member $X ConceptNode: "GO:0000001" EvaluationLink <1,.9> PredicateNode “overexpressed” $X ConceptNode: “Dataset:Crosby_Stills_Nash_76”

ImplicationLink <.71> Member $X ConceptNode: "GO:0000001" EvaluationLink <1,.9> PredicateNode “overexpressed” $X ConceptNode: “Dataset:Bush_Gore_2000c”

ImplicationLink <.5> Member $X ConceptNode: "GO:0000005" EvaluationLink <1,.9> PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997”

ImplicationLink <.4> Member $X ConceptNode: "GO:0000005" EvaluationLink <1,.9> PredicateNode “overexpressed” $X ConceptNode: “Dataset:Crosby_Stills_Nash_76”

ImplicationLink <.71> Member $X ConceptNode: "GO:0000005" EvaluationLink <1,.9> PredicateNode “overexpressed” $X ConceptNode: “Dataset:Bush_Gore_2000c”

That is: GO categories 1 and 5 are overexpressed in the same 3 datasets. This is a weak body of evidence suggesting that GO categories 1 and 5 are related to another and may have further similarities. PLN recognizes this in a couple ways. First of all, abductive inference, after a few steps, would suggest that ExtensionalSimilarityLink < s = .7,.2 or whatever> ConceptNode: “GO:0000001” ConceptNode: “GO:0000005” Secondly, as an alternate route, the InheritanceMiner agent might recognize IntensionalSimilarityLink <.7,.8 or whatever> ConceptNode: “GO:0000001” ConceptNode: “GO:0000005” based on the presence of common properties. Which of these two routes will be more important depends on the specific numbers involved with the situation; but it seems likely that in this case the IntensionalSimilarity is going to be the more significant relationship, since the pure Bayesian similarity estimate (as done in the ExtensionalSimilarityLink truth value calculation) is going to be pretty small given the large number of genes overexpressed in a typical dataset. On the other hand, suppose we had a predicate PredicateNode “expression_top_centile” indicating the expression value was in the top 1%, and suppose that in the above relationships used to exemplify abduction we used this in place of the “overexpressed” predicate. In this case, the ExtensionalSimilarity calculation might come out more meaningful, because the SatisfyingSets of predicates like EvaluationLink <1,.9> PredicateNode “expression_top_centile” $X ConceptNode: “Dataset:Bush_Gore_2000c” are much smaller than for predicates like EvaluationLink <1,.9> PredicateNode “overexpressed” $X ConceptNode: “Dataset:Bush_Gore_2000c”

Rearrangement, Concept Formation Going back to the complex example of pattern mining given above, suppose the Pattern Miner found that

AndLink MemberLink $X ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" MemberLink $X ConceptNode “GO:0000005” EvaluationLink PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997”

PLN can conjecturally rearrange this to ImplicationLink AND MemberLink $X ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" MemberLink $X ConceptNode “GO:0000005” EvaluationLink PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997”

PLN will assign a hypothetical inferred truth value to this link based on logic rules, but once the link is there, then a DirectEvaluation MindAgent (which doesn't currently exist, but should and is easy to code) can update this truth value based on direct evaluation of Implication at the Atoms in the Atomspace. This can also get rearranged to ImplicationLink MemberLink ANDLink ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" ConceptNode “GO:0000005” EvaluationLink PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997”

Now if the combination ANDLink ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" ConceptNode “GO:0000005”

occurs in multiple ImplicationLinks which have reasonably high surprisingness value, it may be valuable for a ConceptCreation MindAgent (not yet existent, but easy to create) to create a new ConceptNode initially seeded by this conjunction, i.e.

SimilarityLink <1,.8> ConceptNode: “Created_from_AndLink_ MSigDB_GeneSet: NUCLEOPLASM_GO:00005” ANDLink ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" ConceptNode “GO:0000005”

As time passes and more inference happens, this newly created ConceptNode might get new members and properties and drift a bit from its origins. Inference is being used as a cue for creating new, potentially interesting concepts. (Of course, Boolean combination of existing concepts is only one among many potentially useful heuristics for new concept creation; conceptual blending is another option, for instance…) Also (going back to the Implication resulting from rearrangement), given the information ImplicationLink MemberLink ANDLink ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" ConceptNode “GO:0000005” EvaluationLink PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997”

IntensionalSimilarityLink <.4,.9> ConceptNode: “Dataset:Smith_Wesson_1997” ConceptNode: “Dataset:Crosby_Stills_Nash_76”

PLN may conjecture

ImplicationLink MemberLink ANDLink ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" ConceptNode “GO:0000005” EvaluationLink PredicateNode “overexpressed” $X ConceptNode: “Dataset:Bush_Gore_2000c”

-- again, a hypothesis that can be checked via direct evaluation.

Causal Inference

Suppose we know

ImplicationLink MemberLink $X ANDLink ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" ConceptNode “GO:0000005” EvaluationLink PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997”

and we also know

Implication <.2> EvaluationLink PredicateNode “overexpressed” $X ConceptNode: “Dataset:Smith_Wesson_1997” CausalImplicationLink EvaluationLink PredicateNode “overexpressed_in_organism” $X $Y EvaluationLink PredicateNode “long-lived” $Y

The latter implication presents the speculation that genes overexpressed in the dataset in question, are causal for longevity (this could be inputted explicitly as metadata for the dataset when it’s inserted into the Atomspace, though a sufficiently smart system could also infer this from a research abstract associated from the dataset or some other general description of the experiment that generated the data). The above could be more abstractly formulated by defining

EquivalenceLink EvaluationLink PredicateNode: “expression_experiment_measures” $E $P

  	Implication 

EvaluationLink PredicateNode “overexpressed” $G $E CausalImplicationLink EvaluationLink PredicateNode “overexpressed_in_organism” $G $Y EvaluationLink $P $Y

and then

EvaluationLink PredicateNode: “expression_experiment_measures” ConceptNode: “Dataset:Smith_Wesson_1997” PredicateNode: “long-lived”

In any case, this lets us guess that

CausalImplicationLink AndLink MemberLink $X ANDLink ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" ConceptNode “GO:0000005” EvaluationLink PredicateNode “overexpressed_in_organism” $X $Y EvaluationLink PredicateNode “long-lived” $Y

On the other hand, from a pathway database we may load into the Atomspace the information that CausalImplicationLink ThereExists $Z: AndLink MemberLink $Z ConceptNode “GO:0000009” EvaluationLink PredicateNode “overexpressed_in_organism” $Z ThereExists $X: AndLink MemberLink $X ConceptNode "MSigDB_GeneSet: NUCLEOPLASM" EvaluationLink PredicateNode “overexpressed_in_organism” $X and CausalImplicationLink ThereExists $Z: AndLink MemberLink $Z ConceptNode “GO:000011” EvaluationLink PredicateNode “overexpressed_in_organism” $Z ThereExists $X: AndLink MemberLink $X ConceptNode “GO:0000005” EvaluationLink PredicateNode “overexpressed_in_organism” $X

Putting these together, one obtains the conjectural conclusion that CausalImplicationLink AndLink MemberLink $X ANDLink ConceptNode “GO:0000009” ConceptNode “GO:000011” EvaluationLink PredicateNode “overexpressed_in_organism” $X $Y EvaluationLink PredicateNode “long-lived” $Y

Thus, by combining causal hypotheses from a pathway DB with causal hypotheses obtained from inference based on the results of expression analysis and bio-ontologies, one obtains new causal hypotheses, which can then be used to generate new experiments etc. The network of conjectured causal relations obtained in such a way can then be pruned by causal network methods such as Matt Ikle’ and I are now experimenting with. And thus may artificial science be done ;-)