QuantitativePredicate

From OpenCog
Jump to: navigation, search

Basic Idea

I think we may want to introduce a particular notion of QuantitativePredicate, to simplify treatment of predicates referring to quantitative features of entities (most simply, e.g. Height, Weight, Age, etc.)

Taking Height as an example, to model height in the Unity3D game world, we could have a GameWorldHeight schema, so that e.g.

ExecutionLink GameWorldHeight tree_77 5

would mean that tree_77 in the game world has a height of 5 units. This shorthand means

ExecutionLink 
   SchemaNode: GameWorldHeight
   ConceptNode: tree_77 
   NumberNode: 5

(Note that the NumberNode named "5" has the number as its name, not its truth value.)

(we could also have an (InheritanceLink GameWorldHeight Height) link, to illustrate that the GameWorldHeight is an instance of the general concept of Height...)

But we could also have a GameWorldHeightPredicate defined so that, say

EvaluationLink GameWorldHeightPredicate tree_77 <.7>

The truth value of the GameWorldHeightPredicate would be determined by quantile normalization from the observed values of the GameWorldHeight SchemaNode (i.e. quantile normalization from the observed distribution of GameWorldHeight, onto the uniform distribution over the interval [0,1]) ...

We would then have a QuantitativePredicateLink, defined so that e.g.

QuantitativePredicateLink GameWorldHeight GameWorldHeightPredicate

There would have to be some QuantitativePredicateUpdater process that updates the quantile normalizations associated with such links periodically.

Next, we could introduce explicit SchemaNodes such as

HeightChange
SizeChange

etc.

So e.g. the

ExecutionLink HeightChange (tree_77 time_1 time_2) 4

would mean that the output of schema HeightChange applied to tree_77 between time time_1 and time time_2, is 4...... This would mean that the height of tree_77 changed by amount 4 between time_1 and time_2 ....

We could then have an associated predicate, HeightChangePredicate, connected to HeightChange via a QuantitativePredicateLink...

Finally, this sets us up to have the system react automatically to surprising changes in quantitative predicates. We can have a process that, for any important predicate, assigns a high STI to instances that give the predicate statistically unlikely values...

So if we have

ExecutionLink HeightChange tree_77 time_1 time_2 4

which is quantile-normalized to

EvaluationLink HeightChangePredicate tree_77 time_1 time_2 <.9>

i.e.

EvaluationLink <.9>
   PredicateNode: HeightChangePredicate 
   ConceptNode: tree_77 
   TimeNode: time_1 
   TimeNode: time_2 

and a height change of 4 (corresponding to truth value .9) is very unusual in the system's memory, then this EvaluationLink would get a high STI...

This requires a probability distribution approximation to be retained for HeightChangePredicate (e.g. a list of values approximating a probability distribution), but it seems reasonable for such a distribution to get maintained for any predicate whose truth value is determined by a QuantitativePredicateLink...

More Details

New Entities

As noted above, we need to introduce a few new Node types,

  • QuantitativeSchemaNode (QSN -- a subclass of SchemaNode)
  • QuantitativePredicateNode (QPN -- a subclass of PredicateNode)

plus the new link types

  • QuantitativePredicateLink (for linking between a QPN and the corresponding QSN)
  • For representing SchemaValueDisributions (SVDs)
    • SchemaValueRecordLink and SchemaValueListLink
  • For representing Quantized Value Distributions (QVDs)
    • QuantizedValueDistributionLink and QuantizedValueListLink

The idea is:

  • Each QuantitativeSchemaNode is associated with an SVD
  • Each QuantitativePredicateNode is associated with a QVD

As Atoms don't contain big objects in the current OpenCog design, what is suggested is to implement the SVD and QVD mathematical/conceptual structures using additional Atom types (see below).

SchemaValueDistributions and their Implementation in Atoms

We now describe the general idea of a SchemaValueDistribution, and an indication of how to implement this kind of object using Atoms via SchemaValueListLinks and SchemaValueNearestLinks.

SchemaValueDistributions

A SchemaValueDistribution, conceptually, is an entity that contains an approximate representation of the set of observed values that a given schema has been known to take. For instance, for the QSN GameWorldHeight, it represents the set of observed values that GameWorldHeight has been observed to take, i.e. the set of v that have occurred in links of the form

GameWorldHeight X v

for various x.

One likely-suitable approximate representation would be using an ordered list of K pairs

(value, count)

The first K values v_1, …, v_K observed for the schema would be entered into the table as entries

(v_i, 1)

Then, when a new value v is observed for the schema, what happens is:

  • the v_i which is *closest* to v is selected
  • the substitution (v_i, n_i) ==> (v_i_new, n_i +1 ) is made

where

v_i_new = [v_i * n_i / (n_i +1)] + v / (n_i +1) 

That is: v_i is moved a little bit in the direction of v

The result of this algorithm is that the K pairs (v_i, n_i) contain an approximate representation of the observed distribution of values of the schema.

(Note, this is much like a 1 dimensional, 1 layer version of the clustering inside DeSTIN.)

K=20 would be a reasonable initial value, I guess. K=100 would be OK too.

Implementing SchemaValueDistributions Using Atoms

For implementation simplicity, we suggest implementing SchemaValueDistribution objects as links, using a new link called a SchemaValueListLink

As opposed to introducing a new kind of software object for SVDs, this strategy makes life simpler in various ways, e.g. there are already tools for loading and persisting and querying links in the Atomspace.

A NumberNode is a node whose name is a number

A SchemaValueListLink contains 2K targets which are NumberNodes, representing the K values v_i, n_i mentioned above.

A SchemaValueRecordLink connects a QSN to a corresponding SchemaValueListLink, i.e.

SchemaValueNearestLink
    QuantitativeSchemaNode: QS
    SchemaValueListLink: SVL

where the SVL contains a list of the form (v_1, n_1, v_2, n_2, ..., v_K, n_K), using the notation indicated above.

QuantizedValueDistributions

A QuantizedValueDistribution, associated with a Quantitative Predicate, contains a value N, and then a list of values w_0, … w_N (which sum to 1), with w_i indicating the value v so that i/N percent of the values of the corresponding Quantitative Schema are less than v.

A function

quantize(QuantitativeSchemaNode QSN)

can be written which transforms the SVD associated with a QSN into a QVD, via quantile normalization. That is: the entry (v_i, n_i) in the SVD, should be taken as a proxy for n_i different observations with value v_i. Then, all of these (n_1 + … + n_K) observations should be viewed implicitly as a sorted list in ascending order, and the quantile boundary values w_i should be determined.

We could start with K=20, for instance.

Implementation of QVDs in terms of Links

A QuantizedValueDistribution is simply a list of NumberNodes, which may have different values at different points in time. We may define a QuantizedValueDistributionLink, and then say

QuantizedValueDistributionLink
   QuantitativePredicateNode: QPN
   QuantizedValueListLink: quantized_value_distribution

The SchemaValueListLink, in this case, lists NumberNodes corresponding to the values w_i mentioned above; i.e. it directly contains the quantized value distribution. The number of entries in the SchemaValueListLink is the number of quantiles.

Using these Distributions

How should these constructs be used?

Basically: When a new link of the form

ExecutionLink
   QuantitativeSchemaNode: Q
   A
   v

is created, then the SVD associated with Q should be updated; and if there is any link of the form

QuantitativePredicateLink Q P

then the QVD associated with P should also be updated.

This process can likely be optimized, but it seems better to get it in place in a simple form first, and then optimize it later if it's actually a speed bottleneck.

Then, when a QuantitativePredicate P is evaluated via

EvaluationLink P X <s>

the truth value strength s can be determined via looking at the QVD corresponding to P and determining what quantile the value v so that (EvaluationLink Q X v) lies in, for the QuantitativeSchemaNode Q corresponding to P. One can then set s equal to, perhaps, a weighted average of the truth values corresponding to the two w_i that v lies between. (The w_i whose corresponding value v is closer to, would get higher weight.)

Testing

This process could be tested by loading in some tables of quantitative values into OpenCog, e.g. GDP of countries as reported in

http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

In this case, one would create ConceptNodes for the different countries, and

QuantitativeSchemaNode: GDP_Schema

QuantitativePredicateNode: GDP_Predicate

QuantitativePredicateLink GDP_Schema GDP_Predicate

One would have, for instance

ExecutionLink
  GDP_Schema
  United_States
  14991300000000

and presumably

EvaluationLink <.98>
  GDP_Predicate
  United_States

or something like that (the .99 being a weighted average of the quantile boundaries 1 and .95, if we have N=20.)

OTOH we'd have

ExecutionLink
  GDP_Schema
  Estonia
  22399000000

and something like

EvaluationLink <around .5>
  GDP_Predicate
  Estonia

representing the fact that Estonia's GDP is middle rank (it's 101 out of 192 on Wikipedia's list)

Of course, this kind of rank ordering is only one way to convert schema values to predicate values, and not always the best way. We will probably end up playing with other ways of doing it too. But the goal here is to implement one initial approach and get it working…

A More Detailed Example

To make these ideas clearer, let's flesh out a particular example further.

Consider the case of GDP_Schema. Let K=5 for illustration purposes, though this is smaller than I'd recommend for real usage.

Suppose the first 5 GDP_Schema evaluations received by the Atomspace are

ExecutionLink
  GDP_Schema
  Estonia
  22399000000

ExecutionLink
  GDP_Schema
  Iran
  521835000000

ExecutionLink
  GDP_Schema
  Colombia
  333185000000

ExecutionLink
  GDP_Schema
  Brazil
  2476651000000

ExecutionLink
  GDP_Schema
  Oman
  76460000000

Then initially we will have a SchemaValueDistribution of the form

v_1 = 22399000000
n_1 =1
v_2 = 76460000000
n_2 =1
v_3=  333185000000
n_3 =1
v_4=  521835000000
n_4 =1
v_5=  2476651000000
n_5 =1

associated with GDP_Schema

Suppose next the system observes

ExecutionLink
  GDP_Schema
  Morocco
  96130000000

Then we have v =96130000000, and we need to use this information to modify the value distribution.

Of the existing v_i values, v is closest to v_2. So, according to the heuristic algorithm outlined above, v_2 will be replaced with a weighted average of {the previous v_2 with v}. In this case, since the existing v_2 is based on only one observation, the weights in the average will be equal. We'll have

v_2_new = .5 * v_2_old + .5*v = 86295000000	

and

n_2_new = n_2_old +1 = 2


Now the SchemaValueDistribution for GDP_Schema looks like

v_1 = 22399000000
n_1 =1
v_2 = 86295000000	
n_2 =2
v_3=  333185000000
n_3 =1
v_4=  521835000000
n_4 =1
v_5=  2476651000000
n_5 =1

so we would have

SchemaValueRecordLink
    SchemaNode: GDP_Schema
    SchemaValueListLink
         NumberNode: 22399000000
         NumberNode: 1
         NumberNode: 86295000000
         NumberNode: 2
         NumberNode: 333185000000
         NumberNode: 1
         NumberNode: 521835000000
         NumberNode: 1
         NumberNode: 2476651000000
         NumberNode: 1


That is: this SVD consists of: One observation at around the given value v_1, two observations at around the given value v_2, and one observation at around each of the given values v_3, v_4, v_5.

After some more observations (say, a total of 20), the SVD for GDP_Schema might look something like

v_1 =8299000000
n_1 =3
v_2 = 96285000000	
n_2 =5
v_3=  234873000000
n_3 =2
v_4=  432654000000
n_4 =4
v_5=  3217651000000
n_5 =6

Note that, even if there were 20000 observations, the SVD would still contain only 10 numbers. That's the point of the SVD representation proposed. It's more compact than having an SVD simply contain a list of all values ever observed for the schema in question. Further, note that it will generally contain information from many specific observations that have long been forgotten from the Atomspace.

Now, how would this SVD get mapped into a QVD? Suppose the number of quantiles, N, equals 4 (again just for simplicity, normally I'd recommend N=10 at least).

We can view the SVD as consisting of 20 observations of the form

8299000000 
8299000000 
8299000000
96285000000	 
96285000000	 
96285000000	 
96285000000	 
96285000000	
234873000000 
234873000000
432654000000 
432654000000 
432654000000 
432654000000
3217651000000 
3217651000000 
3217651000000 
3217651000000 
3217651000000 
3217651000000

(of course it's not necessary to explicitly represent it this verbosely in the software). The actual 20 observations made probably all had different values, but this list is a reasonable approximation, obtained by replacing each observation value with the exemplar value that represents it.

We can then divide these values into N=4 equally-sized bins, i.e.

BIN 0:
8299000000 
8299000000 
8299000000
96285000000	 
96285000000	 

BIN 1:
96285000000	 
96285000000	 
96285000000	
234873000000 
234873000000

BIN 2:
432654000000 
432654000000 
432654000000 
432654000000
3217651000000 

BIN 3:
3217651000000 
3217651000000 
3217651000000 
3217651000000 
3217651000000

We can then set

w_0 = 8299000000 , corresponding to: strength = 0
w_1 = 96285000000, corresponding to: strength = .25
w_2 = 333763500000, corresponding to: strength = .5
w_3 = 3217651000000, corresponding to: strength = .75
w_4 = 3217651000000, corresponding to: strength = 1

Note that for w_2 I've averaged the top value of BIN 1 and the bottom value of BIN 2. This is a reasonable heuristic to use, though just taking the bottom value of BIN 2 or the top value of BIN 1 wouldn't be terribly bad either. (In the other cases, for this particular example dataset, the nearby top and bottom bin values happened to be identical.)

This list of w_i is the PredicateValueDistribution to be associated with the QuantitativePredicateNode GDP_Predicate.

So we can now say

QuantizedValueDistributionLink
    QuantitativePredicateNode: GDP_Predicate
    QuantizedValueListLink
        NumberNode: 8299000000
        NumberNode: 96285000000
        NumberNode: 333763500000
        NumberNode: 3217651000000
        NumberNode: 3217651000000

Given this, we can assign truth values to links of the form

EvaluationLink GDP_Predicate X

for any X.

For instance, suppose the system observes

ExecutionLink
  GDP_Schema
  United_States
  14991300000000

ExecutionLink
  GDP_Schema
  Nigeria
  245229000000

ExecutionLink
  GDP_Schema
  Tonga
  453000000

Then this would yield

EvaluationLink <1>
  GDP_Predicate
  United_States

EvaluationLink <.4>
  GDP_Predicate
  Nigeria

EvaluationLink <0>
  GDP_Predicate
  Tonga

Here the first link gets a strength value of 1 because the corresponding schema output value (14991300000000) is above the top of the recorded QVD. Similarly the third link gets a strength value of 0 because the corresponding schema output value is below the bottom of the recorded QVD.

On the other hand, the middle link gets a strength value of .4 because of the following calculation:

(333763500000 - 245229000000) / (333763500000 - 96285000000) = .37

(245229000000-96285000000 ) / (333763500000 - 96285000000) = .63

.63 * .5 + .37 * .25 = .4

In other words, the schema output value here ( 245229000000) lies between w_1 (corresponding to .25) and w_2 (corresponding to .5), and thus gets a strength value that is a weighted average of .25 and .5. The weights are determined by how close the output value is to w_1 and w_2 respectively.

It may seem wrong to assign the US a GDP strength of 1 and Tonga a GDP strength of 0. But recall that the truth values are actually supposed to be interpreted as probability distributions. Only if the confidence value of these links was 1, would the semantics actually be that the US has GDP strength exactly 1 and Tonga has GDP strength exactly 0. With confidence less than 1, the GDP strength of the US is modeled as a distribution centered close to 1 and monotonically increasing as 1 is increased toward; and the GDP strength of Tonga is modeled as a distribution centered close to 0 and monotonically decreasing as 0 is increased from.

What should the confidence values be, here? The simplest thing is to assign a count value of n=20, since 20 observations underlie the SVD used to generate the QVD. If we have a personality parameter of k=10, then this gives a confidence c = 20 / (20+10) = .66... so we have

EvaluationLink <1,.66>
  GDP_Predicate
  United_States

EvaluationLink <.4,.66>
  GDP_Predicate
  Nigeria

EvaluationLink <0,.66>
  GDP_Predicate
  Tonga

Thus the example is completed...