Measuring Surprisingness
This wiki page somewhat crudely outlines some ideas about evaluation of the "interesting surprisingness" of patterns in data (specifically focused on patterns in OpenCog's Atomspace). These ideas were developed by Shujing Ke and Ben Goertzel during August-September 2014, in the context of Shujing's development of a "frequent and surprising pattern miner" for OpenCog. (Thanks are also due to Linas Vepstas for useful discussion and suggestions; and Zar Goertzel for some bug detection on this wiki page.)
Contents
Motivation
The goal here is to formulate a practical, yet theoretically grounded way to measure the surprisingness of a pattern in a large dataset, esp. of a pattern in a large hypergraph or graph.
The Pattern Miner page gives more context and pretty pictures ... if you've looked at greedy data mining algorithms at all (e.g. Apriori, or especially FP-Growth), this will make sense to you as it's basically a "complicated graph/hypergraph version."
The discussion of surprisingness on this page is in that context -- it's about "what kind of patterns to look for" ... if not just frequent ones then what? Probabilistically surprising ones? Or something subtler? The rating function for an interesting pattern can't be too expensive to evaluate as it has to be applied over and over again in an algorithm that expensively and repeatedly sweeps through a whole huge graph...
Our initial exploration in this direction utilized the interaction information, but this proved troublesome for various reasons, and so we felt moved to develop our own formulas, inspired by the concepts underlying the interaction information.
(For one thing, when measuring a combination of elements, the interaction information tends to scale with the size of the element in the combination that has the highest count. This is not always desirable. It also tends to assign more information to combinations with higher counts overall. It could be viable to normalize the interaction information somehow, but this also leads to complexities. The counterintuitive nature of the interaction information, in spite of its mathematical elegance, has often been observed by others.)
Basic Concepts
The basic philosophical concept of "surprisingness" developed here is, roughly speaking (this hasn't been carefully wordsmithed):
- A predicate is X-surprising, to a particular cognitive system at a particular point in time, if its probability is a lot higher, or a lot lower, than would be predicted by the system via simple inferences 'of type X' from other related knowledge involving "coherent concepts" that the system knows a lot about
The above is a bit fuzzy and we will develop here two mathematical measures along these lines... we will develop here concepts of
- I-surprisingness, short for Independence-surprisingness ... by I-surprisingness we mean, roughly, "a combination that has a probability significantly different from what would derive via independence assumptions on its components"
- II-surprisingness, short for Independence and Inheritance Surprisingness, by which we mean, roughly: Surprisingness associated with a combination C, that can't be trivially inferred from the surprisingness of "some combination of coherent concepts that is a superset or subset combination of C" (a notion to be defined below)
We will look at both normalized and non-normalized versions of these surprisingness measures; most likely the normalized measures will be more frequently useful.
The notion of X-surprisingness could be expanded further via expanding the types of inference included. For instance, one could think about
- PLN-surprisingness, meaning: Surprisingness associated with a combination, that can't be rapidly inferred via PLN from any existing knowledge about coherent concepts in the system's knowledge-base
A problem with this latter definition is its computational cost; it's too expensive to be used as a rating function inside a pattern miner, given current technology. So the practical challenge is to find measures that embody the concept of "X-surprisingness" beyond I- and II- surprisingness, but are tractable to compute within an algorithm that greedily sweeps repeatedly across a large knowledge-store.
Basic Formalization of Surprisingness
To formalize these notions, let's start with the definition of independence-based surprisingness:
isurp(ABC) = I-Surprisingness(ABC) =max { P(ABC) - maxP , minP - P(ABC) } The max value of the probability of all its possible sub-component combinations : maxP = max{ P(AB) * P(C) , P(AC) * P(B), P(BC) * P(A), P(A) * P(B) * P(C) } The min value of the probability of all its possible sub-component combinations : minP = min{ P(AB) * P(C) , P(AC) * P(B), P(BC) * P(A), P(A) * P(B) * P(C) }
We are using the ternary case here for initial exemplification, but the same idea clearly applies for quaternary and higher combinations. (In fact, in the following, we will often use the binary case for working out examples by hand.)
This is a good start, but note that if P(ABC) is small then the I-surprisingness is going to come out small, so that this will usually attribute more I-surprisingness to combinations with a larger count, which is not always going to be desirable. So we can also look at a normalized version:
nisurp(ABC) = isurp(ABC) / P(ABC)
This tells how much independence-assumption-based surprise value the combination delivers, relative to its count (roughly speaking)....
Beyond I-Surprisingness
In order to move toward a subtler notion of surprisingness, beyond very simplistic considerations of probabilistic independence and its violation, let's first look at some basic math related to I-surprisingness as defined above..
What we want to explore here is the relationship between the I-surprisingness of a combination, and the I-surprisingness of its "superset combinations" and "subset combinations". E.g. for the combination "blonde Americans", a superset combination would be "blonde Westerners" and a subset combination would be "blonde Californians."
Let's look initially at the binary case for simplicity...
Given the combination AB, if we have
A = A1 U A1*, B = B1 U B1*
where A1* = A-A1, B1* = B-B1, we will call
- A1 B1 , a subset combination of AB
- A B, a superset combination of A1 B1
We can derive
isurp(A,B) = isurp(A1, B1) + isurp(A1*,B) + isurp(A1, B1*) + isurp(A1*, B1*)
since
P(A B) = P(A1 B1) + P(A1 B1*) + P(A1* B1) + P(A1* B1*)
and
P(A) P(B) = P(A1) P(B) + P(A1) P(B1*) + P(A1*) P(B) + P(A1*) P(B1*)
so that
P(A B) - P(A) P(B) = P(A1 B1) - P(A1) P(B1) + P(A1 B1*) - P(A1) P(B1*) + P(A1* B1) - P(A1*) P(B1) + P(A1* B1*) - P(A1*) P(B1*)
so that any one of the 4 component I-surprisingnesses being large, would make the overall I-surprisingness isurp(A,B) large…
This simple calculation shows that, if any single subset combination of AB is highly I-surprising, it would explain the combination AB being highly I-surprising. Because on average the other 3 subset combinations of AB would come out to have I-surprisingness zero (though with potential for variance due to noise, of course...)
On the other hand, consider the case where
P(A B) = P(A1 B1) + P(A1 B1*) + P(A1* B1) + P(A1* B1*) = X + X + X + X P(A) P(B) = P(A1) P(B) + P(A1) P(B1*) + P(A1*) P(B) + P(A1*) P(B1*) = Y Z + Y Z + Y Z + Y Z
Then
isurp(A,B) = 4X - 4YZ = 4(X-YZ) isurp(A1,B1) = isurp(A1*,B) = isurp(A1, B1*) = isurp(A1*, B1*) = X-YZ nisurp(A,B) = 1 - YZ/X nisurp(A1,B1) = 1 - YZ/X
In this case, the normalized I-surprisingness of (A,B) is the same as that of its various subset combinations, which means that looking at the I-surprisingness of the subset combinations basically gives no new and interesting information... one might as well just look at the I-surprisingness of (A,B) which is more abstract and encompasses more data.
Note that in this example the superset combination has more (non-normalized) I-surprisingness than the subset combinations. But the superset combination does not have more normalized I-surprisingness....
II-Surprisingness
Based on the above considerations, we suggest that, as well as looking at the simplistic probabilistic surprisingness of a combination, one also needs to ask: Can the I-surprisingness of the combination be trivially inferred from other available knowledge?
We can phrase this as: Are there coherent concepts known to the reasoning system at hand, from which the I-surprisingness of the given combination C can be simply extrapolated? If so, then C is not really 'interestingly surprising' to the reasoning system under consideration, even though it may be surprising in raw probabilistic terms.
To pursue this, we will introduce a coherence measure coh(X), which takes in any concept X and outputs a number between 0 and 1. This is a measure of whether X represents a coherent, sensible concept or not....
As a short-term working assumption, we could assume that coh(X)=1 for every concept X explicitly represented by a ConceptNode in an OpenCog Atomspace. But further on, when we have OpenCog systems learning a lot of their own concepts, this won't work anymore, and we will need to be more careful about this. The intuition, for example, is that
- the population of Africa is a coherent concept
- the set obtained by: arranging all African people in order by height, and taking every third person in this ordering ... is not coherent ... nor is "the population of the Central African Republic, plus all tortured people in the world"
Roughly speaking, a concept should be considered coherent if its members share a lot of important elements and properties; i.e. if the concept forms a reasonably high-quality cluster according to an inter-element similarity measure defined by a combination of intension and extension. It will be useful, later on, to have a process that assigns a coherence value to concepts in OpenCog, and e.g. increases the LTI value of concepts that are more coherent.
Given some measure of concept coherence, a viable way to measure interesting surprisingness, beyond I-surprisingness as defined above, would seem to be as follows:
iisurp(ABC) = II-Surprisingness(ABC) =max { P(ABC) - imaxP , iminP - P(ABC) } \\ The below is the max value of: the probability of all ABC's possible sub-component combinations, \\ or, the probability simplistically estimated for it from a coherent \\ superset or subset combination : imaxP = max {maxInd, maxSuper, maxSub} \\ The below indicates the biggest estimate of P(ABC) one can get via \\ reasoning based on an independence assumption from the subcomponents of ABC maxInd = max { P(AB) * P(C) , P(AC) * P(B), P(BC) * P(A), P(A) * P(B) * P(C) } \\ The below indicates the biggest estimate of P(ABC) one can get via \\ reasoning of the form: A is a subset of the coherent concept A1, and \\ P(A1 B C) is known, so we can just assume that A behaves the same way \\ as the rest of A1 ... (or similar reasoning regarding B1, C1) maxSuper = max{ max{g(coh(A1)) *P(A1 B C) P(A)/P(A1) | A_1 is a superset of A}, max{g(coh(B1)) * P(A B1 C) P(B)/P(B1) | B_1 is a superset of B}, max{g(coh(C1)) * P(A B C1) P(C)/P(BC1) | C_1 is a superset of C} } \\ The below indicates the biggest estimate of P(ABC) one can get via \\ reasoning of the form: A, B and C are supersets of the coherent concepts \\ A1, A2 and A3; and, the members of ABC are all actually members of A1 B1 C1 maxSub = max{ g(coh(A1)) * g(coh(B1)) * g(coh(C1)) * P(A1 B1 C1) | A1 is a subset of A, and B1 is a subset of B, and C1 is a subset of C) } \\ Here g() is a function so that g(1) = 1, g(0)= something big, and g(x) \\ decreases monotonically as x goes from 0 to 1. The idea is that ABC \\ will seem more surprising if maxSub is smaller. So if the coherence \\ is 1, then we want it to be easier for maxSub to be small. But if \\ the coherence is near 0, we want it to be harder for maxSub to be small. \\ An extreme but perhaps viable choice would be g(x) = 1/x . \\ The below is the min value of: the probability of all ABC's possible sub-component combinations, \\ or, the probability simplistically estimated for it from a coherent \\ superset or subset combination : iminP = max {minInd, minSuper, minSub} \\ The below indicates the smallest estimate of P(ABC) one can get via \\ reasoning based on an independence assumption from the subcomponents of ABC minInd = min { P(AB) * P(C) , P(AC) * P(B), P(BC) * P(A), P(A) * P(B) * P(C) } \\ The below indicates the smallest estimate of P(ABC) one can get via \\ reasoning of the form: A is a subset of the coherent concept A1, and \\ P(A1 B C) is known, so we can just assume that A behaves the same way \\ as the rest of A1 ... (or similar reasoning regarding B1, C1) minSuper = min{ min{h(coh(A1)) *P(A1 B C) P(A)/P(A1) | A1 is a superset of A}, min{h(coh(A1)) *P(A B1 C) P(B)/P(B1) | B1 is a superset of B}, min{h(coh(C1)) *P(A B C1) P(C)/P(BC1) | C1 is a superset of C} } \\ Here h() is a function so that h(0)=0 and h(1) is larger, \\ and h(x) increases monotonically for x in [0,1]. h(x) = x would \\ be one viable option. The idea is that ABC will seem more surprising \\ if minSuper is bigger. So if the coherence is 1, then we want minSuper \\ to be allowed to be big. But if the coherence is near 0, then we want \\ this to shrink minSuper. \\ The below indicates the smallest estimate of P(ABC) one can get via \\ reasoning of the form: A, B and C are supersets of the coherent concepts \\ A1, A2 and A3; and, the small size of ABC relative to what one would \\ estimate based on independence assumptions from its components, is due \\ to A1 B1 C1 having a small size relative to what would estimate based \\ on independence assumptions from its components minSub = min{ h((A1)) * h(coh(B1)) * h(coh(C1)) * [ P(A1 B1 C1) + minInd_{A*, B*, C*} ] | A1 is a subset of A, and B1 is a subset of B, and C1 is a subset of C) } where A* = A - A1, etc.; and minInd_{A*, B*, C*} indicates minInd evaluated at A*, B*, C* rather than A, B, C. The point of
We can then of course define
niisurp(ABC) = iisurp(ABC) / P(ABC)
One sees from the above that maxSuper, maxSub, minSuper and minSub represent simplistic reasoning processes for estimating the probability P(ABC) based on existing knowledge. One could extend the proposed approach by incorporating more and more complex reasoning processes into it; but this would of course make the calculations more and more expensive. The terms included here represent reasoning processes sufficiently simplistic that they may viably be carried on on a large scale in the course of pattern mining.
A Very Simple Example
A simple example for exploring surprisingness measures was proposed by Linas Vepstas:
- everyone has either blue eyes or brown eyes with P(blue)=P(brown)=1/2
- everyone has either black hair or blonde hair, with P(black)=P(blonde)=1/2
- everyone who has blue eyes also has blonde hair (and thus, everyone who has brown eyes has black hair). * a third variable could be european and chinese. (lets pretend that P(chinese)=P(euro)=1/2)
This may be formalized as
P(blue) = P(brown) = P(black) = P(blonde) = P(euro) = P(chinese) = .5 P( blue & black) = P(brown & blonde) = 0 P(blue & blonde) = P(brown & black) = .5 P(euro & blonde) = P(euro & brown) = P(euro & black) = P(euro & blue) = P(chinese & blonde) = P(chinese & brown) = P(chinese & black) = P(chinese & blue) = P( euro & blue & blonde) = P( chinese & blue & blonde) = P( euro & brown & black) = P( chinese & brown & black) =.25
In this case the only surprising patterns will be:
blue & black, brown & blonde, blue & blonde, brown & black
E.g.
nisurp(blue & blonde) = (.5 - .25) / .5 = .125 nisurp(euro & blue & blonde) = (.25 - maxP ) / .25 =0, since maxP = max{ P(euro) P(blue & blonde), P(blue) P(euro & blonde), P(blonde) P(euro & blue) } = max{ .25, .125, .125} = .25
and similar calculations hold for the other superset combinations of blue & blonde.
For the II-surprisingness, continuing with the {euro blue blonde} combination,
maxSuper = max{ P(human & blue & blonde)P(euro)/P(human), P(euro & human & blonde)P(blue)/P(human), P(euro & blue & human) P(blonde) / P(human)} = .25 minSuper = .125 maxSub = no useful given data
and we get
iisurp(euro blue & blonde) = .25 - .25 =0 niisurp(euro blue & blonde) = 0
This example is basically too small to use to explore the "interesting surprisingness" aspect, via which II-surprisingness differs from I-surprisingness.
A Slightly Less Simple Example
Now let's consider a slightly more complex example that stresses the definition of II-surprisingness a little more.
Consider a situation wherein:
- relative to all people on Earth, having family members seized by roaming men with guns and chopped into pieces is very surprising
- relative to all people in Africa, having family members seized by roaming men with guns and chopped into pieces is still pretty surprising
- relative to people in the Central African Republic, it's not so surprising (cf
To avoid having a long variable name like had_family_members_seized_and_chopped, let's summarize that situation using the term "horror".
Then, if we have a pattern like
A: Evaluation lives_in $var 1 Africa B: Inheritance $var1 horror
then AB is I-surprising, but with
C: Evaluation lives_in $var 1 Central_African_Repubic B: Inheritance $var1 horror
CB is more II-surprising. Further, assuming the brutality is distributed fairly evenly across the CAR, we would also find DB, EB and FB I-surprising, where
D: Evaluation lives_in $var 1 Bangui B: Inheritance $var1 horror ... E: Evaluation lives_in $var 1 Goroumo B: Inheritance $var1 horror ... F: Evaluation lives_in $var 1 Bozoum B: Inheritance $var1 horror
(these are all cities in the CAR...)
However, it's not actually interesting to observe city by city that, for each city in the CAR, people have had family members seized and chopped up.... Rather, the interesting pattern is actually CB. The pattern AB is not as interesting either -- and is actually a bit misleading.
By a probabilistic calculation, AB, CB, DB, EB and FB would all come out genuinely surprising...
However, the I-surprisingness of AB, DB, EB and FB are all derived from the I-surprisingness of CB --- so they are actually not that interesting....
The definition of II-surprisingness takes account of this, and tells us that CB has more II-surprisingness than these other combinations.
To understand this better, let's look at a concrete quantitative version of the example. In this example we will assume that all concepts involved have coherence of 1.
P(Africa) = .2 P(horror & Africa) = .01 P(Africa) * P(horror) = .004 surp(Africa, horror) = P(Africa & horror) - P(Africa) P(horror) = .006 P(CAR) = .05 P(horror) = .02 // note this means all the horror in Africa is in the CAR P(CAR & horror) = .01 P(CAR) * P(horror) = .001 surp(CAR, horror) = P(CAR & horror) - P(CAR) P(horror) = .009 P(Bangui) = .002 P(Bangui & horror) = .0004 // note this means the percentage of horror in // Bangui is the same as the percentage of horror in the CAR overall P(Bangui) * P(horror) = .00004 surp(Bangui, horror) = P(Bangui & horror) - P(Bangui) P(horror) = .00036
Given this set-up, we have
nisurp(Africa, horror) = [ P(Africa & horror) - P(Africa) P(horror) ] / P(Africa & horror) = .6 nisurp(CAR, horror) = [ P(CAR & horror) - P(CAR) P(horror) ] / P(CAR & horror) = .9 nisurp(Bangui, horror) = [ P(Bangui & horror) - P(Bangui) P(horror) ] / P(Bangui & horror) = .9
Next we can compute
maxSuper(CAR, horror) = P(Africa horror) P(CAR) / P(Africa) = .01 * .05 / .2 = .0025 maxSub(CAR, horror)= P(Bangui horror) = .0004 minSuper(CAR, horror)= P(Africa horror) P(CAR) / P(Africa) = .01 * .05 / .2 = .0025 minSub(CAR, horror) = P(Bangui horror) + ( P(CAR-Bangui horror) - P(CAR-Bangui)P(horror) ) = .0004 + ( .0096 - .048 *.02)= .0004 maxSuper(Africa, horror)= no given data maxSub(Africa, horror)= P(CAR horror) = .01 minSuper(Africa, horror)= no given data minSub(Africa, horror)= P(CAR horror) + ( P(Africa-CAR horror) - P(Africa-CAR)P(horror) ) = .01 + ( 0 - .15 *.02) = .007 maxSuper(Bangui, horror)= P(CAR horror) P(Bangui) / P(CAR) = .01 * .002/.05 = .0004 maxSub(Bangui, horror)= no given data minSuper(Bangui, horror)= P(Africa horror) P(Bangui) / P(Africa) = .01 * .002/.2 = .0001 minSub(Bangui, horror)= no given data
Thus we calculate
imaxP(Africa, horror) = .01 iminP(Africa, horror) = .004 imaxP(CAR, horror) = .0025 iminP(CAR, horror) = .0004 imaxP(Bangui, horror) = .0004 iminP(Bangui, horror) = .00004
and we find
iisurp(Africa,horror) = .01 - .01 = 0 iisurp(CAR, horror) = .01 - .001 = .0099 iisurp(Bangui, horror) = .0004 - .0004 = 0
and
niisurp(Africa,horror) = 0 niisurp(CAR, horror) = .0099 / .01 = 0.99 niisurp(Bangui, horror) = 0
... which agrees with the intuitive treatment of the example!
From Simple Examples to the Atomspace
In reality, to express the patterns from this example in the Atomspace, we'd have something like
H: Inheritance $var1 human A: Evaluation lives_in $var1 CAR B: Inheritance $var1 experienced_horror
and the surprising pattern would be HAB ... but the math still works that way...