# Basic Idea

I think we may want to introduce a particular notion of QuantitativePredicate, to simplify treatment of predicates referring to quantitative features of entities (most simply, e.g. Height, Weight, Age, etc.)

Taking Height as an example, to model height in the Unity3D game world, we could have a GameWorldHeight schema, so that e.g.

```ExecutionLink GameWorldHeight tree_77 5
```

would mean that tree_77 in the game world has a height of 5 units. This shorthand means

```ExecutionLink
SchemaNode: GameWorldHeight
ConceptNode: tree_77
NumberNode: 5
```

(Note that the NumberNode named "5" has the number as its name, not its truth value.)

(we could also have an (InheritanceLink GameWorldHeight Height) link, to illustrate that the GameWorldHeight is an instance of the general concept of Height...)

But we could also have a GameWorldHeightPredicate defined so that, say

```EvaluationLink GameWorldHeightPredicate tree_77 <.7>
```

The truth value of the GameWorldHeightPredicate would be determined by quantile normalization from the observed values of the GameWorldHeight SchemaNode (i.e. quantile normalization from the observed distribution of GameWorldHeight, onto the uniform distribution over the interval [0,1]) ...

We would then have a QuantitativePredicateLink, defined so that e.g.

```QuantitativePredicateLink GameWorldHeight GameWorldHeightPredicate
```

There would have to be some QuantitativePredicateUpdater process that updates the quantile normalizations associated with such links periodically.

Next, we could introduce explicit SchemaNodes such as

```HeightChange
SizeChange
```

etc.

So e.g. the

```ExecutionLink HeightChange (tree_77 time_1 time_2) 4
```

would mean that the output of schema HeightChange applied to tree_77 between time time_1 and time time_2, is 4...... This would mean that the height of tree_77 changed by amount 4 between time_1 and time_2 ....

We could then have an associated predicate, HeightChangePredicate, connected to HeightChange via a QuantitativePredicateLink...

Finally, this sets us up to have the system react automatically to surprising changes in quantitative predicates. We can have a process that, for any important predicate, assigns a high STI to instances that give the predicate statistically unlikely values...

So if we have

```ExecutionLink HeightChange tree_77 time_1 time_2 4
```

which is quantile-normalized to

```EvaluationLink HeightChangePredicate tree_77 time_1 time_2 <.9>
```

i.e.

```EvaluationLink <.9>
PredicateNode: HeightChangePredicate
ConceptNode: tree_77
TimeNode: time_1
TimeNode: time_2
```

and a height change of 4 (corresponding to truth value .9) is very unusual in the system's memory, then this EvaluationLink would get a high STI...

This requires a probability distribution approximation to be retained for HeightChangePredicate (e.g. a list of values approximating a probability distribution), but it seems reasonable for such a distribution to get maintained for any predicate whose truth value is determined by a QuantitativePredicateLink...

# More Details

## New Entities

As noted above, we need to introduce a few new Node types,

• QuantitativeSchemaNode (QSN -- a subclass of SchemaNode)
• QuantitativePredicateNode (QPN -- a subclass of PredicateNode)

• For representing SchemaValueDisributions (SVDs)
• For representing Quantized Value Distributions (QVDs)

The idea is:

• Each QuantitativeSchemaNode is associated with an SVD
• Each QuantitativePredicateNode is associated with a QVD

As Atoms don't contain big objects in the current OpenCog design, what is suggested is to implement the SVD and QVD mathematical/conceptual structures using additional Atom types (see below).

## SchemaValueDistributions and their Implementation in Atoms

We now describe the general idea of a SchemaValueDistribution, and an indication of how to implement this kind of object using Atoms via SchemaValueListLinks and SchemaValueNearestLinks.

### SchemaValueDistributions

A SchemaValueDistribution, conceptually, is an entity that contains an approximate representation of the set of observed values that a given schema has been known to take. For instance, for the QSN GameWorldHeight, it represents the set of observed values that GameWorldHeight has been observed to take, i.e. the set of v that have occurred in links of the form

```GameWorldHeight X v
```

for various x.

One likely-suitable approximate representation would be using an ordered list of K pairs

```(value, count)
```

The first K values v_1, …, v_K observed for the schema would be entered into the table as entries

```(v_i, 1)
```

Then, when a new value v is observed for the schema, what happens is:

• the v_i which is *closest* to v is selected
• the substitution (v_i, n_i) ==> (v_i_new, n_i +1 ) is made

where

```v_i_new = [v_i * n_i / (n_i +1)] + v / (n_i +1)
```

That is: v_i is moved a little bit in the direction of v

The result of this algorithm is that the K pairs (v_i, n_i) contain an approximate representation of the observed distribution of values of the schema.

(Note, this is much like a 1 dimensional, 1 layer version of the clustering inside DeSTIN.)

K=20 would be a reasonable initial value, I guess. K=100 would be OK too.

### Implementing SchemaValueDistributions Using Atoms

For implementation simplicity, we suggest implementing SchemaValueDistribution objects as links, using a new link called a SchemaValueListLink

As opposed to introducing a new kind of software object for SVDs, this strategy makes life simpler in various ways, e.g. there are already tools for loading and persisting and querying links in the Atomspace.

A NumberNode is a node whose name is a number

A SchemaValueListLink contains 2K targets which are NumberNodes, representing the K values v_i, n_i mentioned above.

```SchemaValueNearestLink
QuantitativeSchemaNode: QS
```

where the SVL contains a list of the form (v_1, n_1, v_2, n_2, ..., v_K, n_K), using the notation indicated above.

## QuantizedValueDistributions

A QuantizedValueDistribution, associated with a Quantitative Predicate, contains a value N, and then a list of values w_0, … w_N (which sum to 1), with w_i indicating the value v so that i/N percent of the values of the corresponding Quantitative Schema are less than v.

A function

```quantize(QuantitativeSchemaNode QSN)
```

can be written which transforms the SVD associated with a QSN into a QVD, via quantile normalization. That is: the entry (v_i, n_i) in the SVD, should be taken as a proxy for n_i different observations with value v_i. Then, all of these (n_1 + … + n_K) observations should be viewed implicitly as a sorted list in ascending order, and the quantile boundary values w_i should be determined.

### Implementation of QVDs in terms of Links

A QuantizedValueDistribution is simply a list of NumberNodes, which may have different values at different points in time. We may define a QuantizedValueDistributionLink, and then say

```QuantizedValueDistributionLink
QuantitativePredicateNode: QPN
```

The SchemaValueListLink, in this case, lists NumberNodes corresponding to the values w_i mentioned above; i.e. it directly contains the quantized value distribution. The number of entries in the SchemaValueListLink is the number of quantiles.

## Using these Distributions

How should these constructs be used?

Basically: When a new link of the form

```ExecutionLink
QuantitativeSchemaNode: Q
A
v
```

is created, then the SVD associated with Q should be updated; and if there is any link of the form

```QuantitativePredicateLink Q P
```

then the QVD associated with P should also be updated.

This process can likely be optimized, but it seems better to get it in place in a simple form first, and then optimize it later if it's actually a speed bottleneck.

Then, when a QuantitativePredicate P is evaluated via

```EvaluationLink P X <s>
```

the truth value strength s can be determined via looking at the QVD corresponding to P and determining what quantile the value v so that (EvaluationLink Q X v) lies in, for the QuantitativeSchemaNode Q corresponding to P. One can then set s equal to, perhaps, a weighted average of the truth values corresponding to the two w_i that v lies between. (The w_i whose corresponding value v is closer to, would get higher weight.)

## Testing

This process could be tested by loading in some tables of quantitative values into OpenCog, e.g. GDP of countries as reported in

In this case, one would create ConceptNodes for the different countries, and

```QuantitativeSchemaNode: GDP_Schema

QuantitativePredicateNode: GDP_Predicate

```

One would have, for instance

```ExecutionLink
GDP_Schema
United_States
14991300000000
```

and presumably

```EvaluationLink <.98>
GDP_Predicate
United_States
```

or something like that (the .99 being a weighted average of the quantile boundaries 1 and .95, if we have N=20.)

OTOH we'd have

```ExecutionLink
GDP_Schema
Estonia
22399000000
```

and something like

```EvaluationLink <around .5>
GDP_Predicate
Estonia
```

representing the fact that Estonia's GDP is middle rank (it's 101 out of 192 on Wikipedia's list)

Of course, this kind of rank ordering is only one way to convert schema values to predicate values, and not always the best way. We will probably end up playing with other ways of doing it too. But the goal here is to implement one initial approach and get it working…

## A More Detailed Example

To make these ideas clearer, let's flesh out a particular example further.

Consider the case of GDP_Schema. Let K=5 for illustration purposes, though this is smaller than I'd recommend for real usage.

Suppose the first 5 GDP_Schema evaluations received by the Atomspace are

```ExecutionLink
GDP_Schema
Estonia
22399000000

GDP_Schema
Iran
521835000000

GDP_Schema
Colombia
333185000000

GDP_Schema
Brazil
2476651000000

GDP_Schema
Oman
76460000000

```

Then initially we will have a SchemaValueDistribution of the form

```v_1 = 22399000000
n_1 =1
v_2 = 76460000000
n_2 =1
v_3=  333185000000
n_3 =1
v_4=  521835000000
n_4 =1
v_5=  2476651000000
n_5 =1
```

associated with GDP_Schema

Suppose next the system observes

```ExecutionLink
GDP_Schema
Morocco
96130000000
```

Then we have v =96130000000, and we need to use this information to modify the value distribution.

Of the existing v_i values, v is closest to v_2. So, according to the heuristic algorithm outlined above, v_2 will be replaced with a weighted average of {the previous v_2 with v}. In this case, since the existing v_2 is based on only one observation, the weights in the average will be equal. We'll have

```v_2_new = .5 * v_2_old + .5*v = 86295000000

and

n_2_new = n_2_old +1 = 2
```

Now the SchemaValueDistribution for GDP_Schema looks like

```v_1 = 22399000000
n_1 =1
v_2 = 86295000000
n_2 =2
v_3=  333185000000
n_3 =1
v_4=  521835000000
n_4 =1
v_5=  2476651000000
n_5 =1
```

so we would have

```SchemaValueRecordLink
SchemaNode: GDP_Schema
NumberNode: 22399000000
NumberNode: 1
NumberNode: 86295000000
NumberNode: 2
NumberNode: 333185000000
NumberNode: 1
NumberNode: 521835000000
NumberNode: 1
NumberNode: 2476651000000
NumberNode: 1
```

That is: this SVD consists of: One observation at around the given value v_1, two observations at around the given value v_2, and one observation at around each of the given values v_3, v_4, v_5.

After some more observations (say, a total of 20), the SVD for GDP_Schema might look something like

```v_1 =8299000000
n_1 =3
v_2 = 96285000000
n_2 =5
v_3=  234873000000
n_3 =2
v_4=  432654000000
n_4 =4
v_5=  3217651000000
n_5 =6
```

Note that, even if there were 20000 observations, the SVD would still contain only 10 numbers. That's the point of the SVD representation proposed. It's more compact than having an SVD simply contain a list of all values ever observed for the schema in question. Further, note that it will generally contain information from many specific observations that have long been forgotten from the Atomspace.

Now, how would this SVD get mapped into a QVD? Suppose the number of quantiles, N, equals 4 (again just for simplicity, normally I'd recommend N=10 at least).

We can view the SVD as consisting of 20 observations of the form

```8299000000
8299000000
8299000000
96285000000
96285000000
96285000000
96285000000
96285000000
234873000000
234873000000
432654000000
432654000000
432654000000
432654000000
3217651000000
3217651000000
3217651000000
3217651000000
3217651000000
3217651000000
```

(of course it's not necessary to explicitly represent it this verbosely in the software). The actual 20 observations made probably all had different values, but this list is a reasonable approximation, obtained by replacing each observation value with the exemplar value that represents it.

We can then divide these values into N=4 equally-sized bins, i.e.

```BIN 0:
8299000000
8299000000
8299000000
96285000000
96285000000

BIN 1:
96285000000
96285000000
96285000000
234873000000
234873000000

BIN 2:
432654000000
432654000000
432654000000
432654000000
3217651000000

BIN 3:
3217651000000
3217651000000
3217651000000
3217651000000
3217651000000
```

We can then set

```w_0 = 8299000000 , corresponding to: strength = 0
w_1 = 96285000000, corresponding to: strength = .25
w_2 = 333763500000, corresponding to: strength = .5
w_3 = 3217651000000, corresponding to: strength = .75
w_4 = 3217651000000, corresponding to: strength = 1
```

Note that for w_2 I've averaged the top value of BIN 1 and the bottom value of BIN 2. This is a reasonable heuristic to use, though just taking the bottom value of BIN 2 or the top value of BIN 1 wouldn't be terribly bad either. (In the other cases, for this particular example dataset, the nearby top and bottom bin values happened to be identical.)

This list of w_i is the PredicateValueDistribution to be associated with the QuantitativePredicateNode GDP_Predicate.

So we can now say

```QuantizedValueDistributionLink
QuantitativePredicateNode: GDP_Predicate
NumberNode: 8299000000
NumberNode: 96285000000
NumberNode: 333763500000
NumberNode: 3217651000000
NumberNode: 3217651000000
```

Given this, we can assign truth values to links of the form

```EvaluationLink GDP_Predicate X
```

for any X.

For instance, suppose the system observes

```ExecutionLink
GDP_Schema
United_States
14991300000000

GDP_Schema
Nigeria
245229000000

GDP_Schema
Tonga
453000000
```

Then this would yield

```EvaluationLink <1>
GDP_Predicate
United_States

GDP_Predicate
Nigeria

GDP_Predicate
Tonga
```

Here the first link gets a strength value of 1 because the corresponding schema output value (14991300000000) is above the top of the recorded QVD. Similarly the third link gets a strength value of 0 because the corresponding schema output value is below the bottom of the recorded QVD.

On the other hand, the middle link gets a strength value of .4 because of the following calculation:

```(333763500000 - 245229000000) / (333763500000 - 96285000000) = .37

(245229000000-96285000000 ) / (333763500000 - 96285000000) = .63

.63 * .5 + .37 * .25 = .4
```

In other words, the schema output value here ( 245229000000) lies between w_1 (corresponding to .25) and w_2 (corresponding to .5), and thus gets a strength value that is a weighted average of .25 and .5. The weights are determined by how close the output value is to w_1 and w_2 respectively.

It may seem wrong to assign the US a GDP strength of 1 and Tonga a GDP strength of 0. But recall that the truth values are actually supposed to be interpreted as probability distributions. Only if the confidence value of these links was 1, would the semantics actually be that the US has GDP strength exactly 1 and Tonga has GDP strength exactly 0. With confidence less than 1, the GDP strength of the US is modeled as a distribution centered close to 1 and monotonically increasing as 1 is increased toward; and the GDP strength of Tonga is modeled as a distribution centered close to 0 and monotonically decreasing as 0 is increased from.

What should the confidence values be, here? The simplest thing is to assign a count value of n=20, since 20 observations underlie the SVD used to generate the QVD. If we have a personality parameter of k=10, then this gives a confidence c = 20 / (20+10) = .66... so we have

```EvaluationLink <1,.66>
GDP_Predicate
United_States