# PLNBook

Probabilistic Logic Networks:
A New Conceptual, Mathematical and Computational Framework for Uncertain Inference

Ben Goertzel, Matthew Iklé, Izabela Freire Goertzel, Ari Heljakka

Note: Only the first 5 chapters have been wikified. Find the full free pdf here.

# Chapter 1: Introduction

Abstract In this chapter we provide an overview of probabilistic logic networks (PLN), including our motivations for developing PLN and the guiding principles underlying PLN. We discuss foundational choices we made, introduce PLN knowledge representation, and briefly introduce inference rules and truth-values. We also place PLN in context with other approaches to uncertain inference.

## Motivations

This book presents Probabilistic Logic Networks (PLN), a systematic and pragmatic framework for computationally carrying out uncertain reasoning— reasoning about uncertain data, and/or reasoning involving uncertain conclusions. We begin with a few comments about why we believe this is such an interesting and important domain of investigation.

First of all, we hold to a philosophical perspective in which “reasoning”– properly understood– plays a central role in cognitive activity. We realize that other perspectives exist; in particular, logical reasoning is sometimes construed as a special kind of cognition that humans carry out only occasionally, as a deviation from their usual (intuitive, emotional, pragmatic, sensorimotor, etc.) modes of thought. However, we consider this alternative view to be valid only according to a very limited definition of “logic.” Construed properly, we suggest, logical reasoning may be understood as the basic framework underlying all forms of cognition, including those conventionally thought of as illogical and irrational. The key to this kind of extended understanding of logic, we argue, is the formulation of an appropriately general theory of uncertain reasoning – where what is meant by the latter phrase is: reasoning based on uncertain knowledge, and/or reasoning leading to uncertain conclusions (whether from certain or uncertain knowledge). Moving from certain to uncertain reasoning opens up a Pandora’s box of possibilities, including the ability to encompass within logic things such as induction, abduction, analogy and speculation, and reasoning about time and causality.

While not necessarily pertinent to the technical details of PLN, it is perhaps worth noting that the authors’ main focus in exploring uncertain inference has been its pertinence to our broader work on artificial general intelligence (AGI). As elaborated in (Goertzel and Pennachin 2007; Goertzel and Wang 2007; Wang et al 2008), AGI refers to the construction of intelligent systems that can carry out a variety of complex goals in complex environments, based on a rich contextual understanding of themselves, their tasks and their environments. AGI was the original motivating goal of theAI research field, but at the moment it is one among multiple streams of AI research, living alongside other subfields focused on more narrow and specialized problem-solving. One viable approach to achieving powerful AGI, we believe, is to create integrative software systems with uncertain inference at their core. Specifically, PLN has been developed within the context of a larger artificial intelligence project, the Novamente Cognition Engine or NCE (Goertzel 2006), which seeks to achieve general forms of cognition by integrating PLN with several other processes. Recently, the NCE has spawned an open-source sister project called OpenCog, as well (Hart and Goertzel 2008). In the final two chapters we will briefly discuss the implementation of PLN within the NCE, and give a few relevant details of the NCE architecture. However, the vast majority of the discussion of PLN here is independent of the utilization of PLN as a component of the NCE. PLN stands as a conceptual and mathematical construct in its own right, with potential usefulness in a wide variety of AI and AGI applications.

We also feel that the mathematical and conceptual aspects of PLN have the potential to be useful outside the AI context, both as purely mathematical content and as guidance for understanding the nature of probabilistic inference in humans and other natural intelligences. These aspects are not emphasized here but we may address them more thoroughly in future works.

Of course, there is nothing new about the idea that uncertain inference is broadly important and relevant to AI and other domains. Over the past few decades a number of lines of research have been pursued, aimed at formalizing uncertain inference in a manner capable of application across the broad scope of varieties of cognition. PLN incorporates ideas derived from many of these other lines of inquiry, including standard ones like Bayesian probability theory (Jaynes, 2003), fuzzy logic (Zadeh 1989), and less standard ones like the theory of imprecise probabilities (Walley 1991), term logic (Sommers and Englebretsen 2000), Pei Wang’s Non-Axiomatic Reasoning System (NARS) (Wang 1996), and algorithmic information theory (Chaitin 1987). For various reasons, which will come out as the book proceeds, we have found each of these prior attempts (and other ones, from which we have not seen fit to appropriate ideas, some of which we will mention below) unsatisfactory as a holistic approach to uncertain inference or as a guide to the creation of an uncertain inference component for use in integrative AGI systems.

Among the general high-level requirements underlying the development of PLN have been the following:

• To enable uncertainty-savvy versions of all known varieties of logical reasoning; including, for instance, higher-order reasoning involving quantifiers, higher-order functions, and so forth.
• To reduce to crisp “theorem prover” style behavior in the limiting case where uncertainty tends to zero.
• To encompass inductive and abductive as well as deductive reasoning.
• To agree with probability theory in those reasoning cases where probability theory, in its current state of development, provides solutions within reasonable calculational effort based on assumptions that are plausible in the context of real-world embodied software systems.
• To gracefully incorporate heuristics not explicitly based on probability theory, in cases where probability theory, at its current state of development, does not provide adequate pragmatic solutions.
• To provide “scalable” reasoning, in the sense of being able to carry out inferences involving at least billions of premises. Of course, when the number of premises is fewer, more intensive and accurate reasoning may be carried out.
• To easily accept input from, and send input to, natural language processing software systems.

The practical application of PLN is still at an early stage. Based on our evidence so far, however, we have found PLN to fulfill the above requirements adequately well, and our intuition remains that it will be found to do so in general. We stress, however, that PLN is an evolving framework, consisting of a conceptual core fleshed out by a heterogeneous combination of components. As PLN applications continue to be developed, it seems likely that various PLN components will be further refined and perhaps some of them replaced entirely. We have found the current component parts of PLN acceptable for our work so far, but we have also frequently been aware of more sophisticated alternative approaches to various sub-problems (some drawn from the literature, and some our own inventions), and have avoided pursuing many of such due to a desire for initial simplicity.

The overall structure of PLN theory is relatively simple, and may be described as follows. First, PLN involves some important choices regarding knowledge representation, which lead to specific “schematic forms” for logical inference rules. The knowledge representation may be thought of as a definition of a set of “logical term types” and “logical relationship types,” leading to a novel way of graphically modeling bodies of knowledge. It is this graphical interpretation of PLN knowledge representation that led to the “network” part of the name “Probabilistic Logic Networks.” It is worth noting that the networks used to recognize knowledge in PLN are weighted directed hypergraphs (Bollobas 1998) much more general than, for example, the binary directed acyclic graphs used in Bayesian network theory (Pearl 1988).

Next, PLN involves specific mathematical formulas for calculating the probability value of the conclusion of an inference rule based on the probability values of the premises plus (in some cases) appropriate background assumptions. It also involves a particular approach to estimating the confidence values with which these probability values are held (weight of evidence, or second-order uncertainty). Finally, the implementation of PLN in software requires important choices regarding the structural representation of inference rules, and also regarding “inference control” – the strategies required to decide what inferences to do in what order, in each particular practical situation.

### Why Probability Theory?

In the next few sections of this Introduction we review the conceptual foundations of PLN in a little more detail, beginning with the question: Why choose probability theory as a foundation for the “uncertain” part of uncertain inference?

We note that while probability theory is the foundation of PLN, not all aspects of PLN are based strictly on probability theory. The mathematics of probability theory (and its interconnection with other aspects of mathematics) has not yet been developed to the point where it is feasible to use explicitly probabilistic methods to handle every aspect of the uncertain inference process. Some researchers have reacted to this situation by disregarding probability theory altogether and introducing different conceptual foundations for uncertain inference, such as Dempster Shafer theory (Dempster 1968; Shafer 1976), Pei Wang’s Non-Axiomatic Reasoning System (Wang 1996), possibility theory (Zadeh 1978) and fuzzy set theory (Zadeh 1965). Others have reacted by working within a rigidly probabilistic framework, but limiting the scope of their work fairly severely based on the limitations of the available probabilistic mathematics, avoiding venturing into the more ambiguous domain of creating heuristics oriented toward making probabilistic inference more scalable and pragmatically applicable (this, for instance, is how we would characterize the mainstream work in probabilistic logic as summarized in Hailperin 1996; more comments on this below). Finally, a third reaction – and the one PLN embodies – is to create reasoning systems based on a probabilistic foundation and then layer non-probabilistic ideas on top of this foundation when this is the most convenient way to arrive at useful practical results.

Our faith in probability theory as the ultimately “right” way to quantify uncertainty lies not only in the demonstrated practical applications of probability theory to date, but also in Cox’s (1961) axiomatic development of probability theory and ensuing refinements (Hardy 2002), and associated mathematical arguments due to de Finetti (1937) and others. These theorists have shown that if one makes some very basic assumptions about the nature of uncertainty quantification, the rules of elementary probability theory emerge as if by magic. In this section we briefly review these ideas, as they form a key part of the conceptual foundation of the PLN framework.

Cox’s original demonstration involved describing a number of properties that should commonsensically hold for any quantification of the “plausibility” of a proposition, and then showing that these properties imply that plausibility must be a scaled version of conventional probability. The properties he specified are, in particular.[1]

  1 The plausibility of a proposition determines the plausibility of the proposition’s
negation; either decreases as the other increases. Because “a double negative is an
affirmative,” this becomes a functional equation
${\displaystyle f(f(x))=X}$
saying that the function f that maps the probability of a proposition to the probability of the
proposition’s negation is an involution; i.e., it is its own inverse.

  2 The plausibility of the conjunction [A</nowiki> & B] of two propositions A, B, depends only on
the plausibility of B and that of A given that B is true. (From this Cox eventually infers that
multiplication of probabilities is associative, and then that it may as well be ordinary
multiplication of real numbers.) Because of the associative nature of the “and” operation in
propositional logic, this becomes a functional equation saying that the function g such that
${\displaystyle P(A\ {\text{and}}\ B)=g(P(A),P(B\mid A))}$

  3 Suppose [A & B] is equivalent to [C & D]. If we acquire new information A and then acquire
further new information B, and update all probabilities each time, the updated probabilities
will be the same as if we had first acquired new information C and then a</nowiki>cquired
further new information D. In view of the fact that multiplication of probabilities can be
taken to be ordinary multiplication of real numbers, this becomes a functional equation
${\displaystyle yf\left({\frac {f(z)}{y}}\right)=zf\left({\frac {f(y)}{z}}\right)}$
Where f is as above


Cox’s theorem states, in essence, that any measure of plausibility that possesses the above three properties must be a rescaling of standard probability.

While it is impressive that so much (the machinery of probability theory) can be derived from so little (Cox’s very commonsensical assumptions), mathematician Michael Hardy (2002) has expressed the opinion that in fact Cox’s axioms are too strong, and has provided significantly weaker conditions that lead to the same end result as Cox’s three properties. Hardy’s conditions are more abstract and difficult to state without introducing a lot of mathematical mechanism, but essentially he studies mappings from propositions into ordered “plausibility” values, and he shows that if any such mapping obeys the properties of

1. If x implies y then f(x) < f(y)

2. If f(x) < f(y) then f(¬x) > f(¬y), where ¬ represents “not”

3. If f(x|z) <= f(y|z) and f(x|¬z) <= f(y|¬z) then f(x) < f(y)

4. For all x, y either f(x) ${\displaystyle \leq }$ f(y) or f(y) ${\displaystyle \leq }$ f(x)

then it maps propositions into scaled probability values. Note that this property list mixes up absolute probabilities f() with conditional probabilities f(|), but this is not a problem because Hardy considers f(x) as equivalent to f(x|U) where U is the assumed universe of discourse.

Hardy expresses regret that his fourth property is required; however, Youssef’s (1994) work related to Cox’s axioms suggests that it is probably there in his mathematics for a very good conceptual reason. Youssef has shown that it is feasible to drop Cox’s assumption that uncertainty must be quantified using real numbers, but retain Cox’s other assumptions. He shows it is possible, consistent with Cox’s other assumptions, to quantify uncertainty using “numbers” drawn from the complex, quaternion, or octonion number fields. Further, he argues that complex-valued “probabilities” are the right way to model quantum-level phenomena that have not been collapsed (decohered) into classical phenomena. We believe his line of argument is correct and quite possibly profound, yet it does not seem to cast doubt on the position of standard real-valued probability theory as the correct mathematics for reasoning about ordinary, decohered physical systems. If one wishes to reason about the uncertainty existing in pure, pre-decoherence quantum systems or other exotic states of being, then arguably these probability theories defined over different base fields than the real numbers may be applicable.

Next, while we are avid probabilists, we must distinguish ourselves from the most ardent advocates of the “Bayesian” approach to probabilistic inference. We understand the weakness of the traditional approach to statistics with its reliance on often unmotivated assumptions regarding the functional forms of probability distributions. On the other hand, we don’t feel that the solution is to replace these assumptions with other, often unmotivated assumptions about prior probability distributions. Bayes’ rule is an important part of probability theory, but the way that the Bayesian-statistical approach applies it is not always the most useful way. A major example of the shortcomings of the standard Bayesian approach lies in the domain of confidence assessment, an important aspect of PLN already mentioned above. As Wang (2001) has argued in detail, the standard Bayesian approach does not offer any generally viable way to assess or reason about the “second-order uncertainty” involved in a given uncertainty value. Walley (1991) sought to redress this problem via a subtler approach that avoids assuming a single prior distribution, and makes a weaker assumption involving drawing a prior from a parametrized family of possible prior distributions; others have followed up his work in interesting ways (Weichselberger 2003), but this line of research has not yet been developed to the point of yielding robustly applicable mathematics. Within PLN, we introduce a spectrum of approaches to confidence assessment ranging from indefinite probabilities (essentially a hybridization of Walley’s imprecise probabilities with Bayesian credible intervals) to frankly non-probabilistic heuristics inspired partly by Wang’s work. By utilizing this wide range of approaches, PLN can more gracefully assess confidence in diverse settings, providing pragmatic solutions where the Walley-type approach (in spite of its purer probabilism) currently fails.

Though Cox’s theorem and related results argue convincingly that probability theory is the correct approach to reasoning under uncertainty, the particular ways of applying probability theory that have emerged in the contemporary scientific community (such as the “Bayesian approach”) all rely on specific assumptions beyond those embodied in the axioms of probability theory. Some of these assumptions are explicit mathematical ones, and others are implicit assumptions about how to proceed in setting up a given problem in probabilistic terms; for instance, how to translate an intuitive understanding and/or a collection of quantitative data into mathematical probabilistic form.

Notes

1. The following list of properties is paraphrased from the Wikipedia entry for “Cox’s Theorem.”

## PLN in the Context of Traditional Approaches to Probabilistic Logic

So, supposing one buys the notion that logic, adequately broadly construed, is essential (perhaps even central) to cognition; that appropriate integration of uncertainty into logic is an important aspect of construing logic in an adequately broad way; and also that probability theory is the correct foundation for treatment of uncertainty, what then? There is already a fairly well fleshed-out theory of probabilistic logic, so why does one need a substantial body of new theory such as Probabilistic Logic Networks?

The problem is that the traditional theories in the area of probabilistic logic don’t directly provide a set of tools one can use to structure a broadly-applicable, powerful software system for probabilistic inferencing. They provide a number of interesting and important theorems and ideas, but are not sufficiently pragmatic in orientation, and also fail to cover some cognitively key aspects of uncertain inference such as intensional inference.

Halpern’s (2003) book provides a clearly written, reasonably thorough overview of recent theories in probabilistic logic. The early chapters of Hailperin (1996) gives some complementary historical and theoretical background. Alongside other approaches such as possibility theory, Halpern gives an excellent summary of what in PLN terms would be called “first-order extensional probabilistic logic” – the interpretation and manipulation of simple logic formulas involving absolute and conditional probabilities among sets. Shortcomings of this work from a pragmatic AI perspective include:

• No guidance is provided as to which heuristic independence assumptions are most cognitively natural to introduce in order to deal with (the usual) situations where adequate data regarding dependencies is unavailable. Rather, exact probabilistic logic formulas are introduced, into which one can, if one wishes, articulate independence assumptions and then derive their consequences.
• Adequate methods for handling “second order uncertainty” are not presented, but this is critical for dealing with real-world inference situations where available data is incomplete and/or erroneous. Hailperin (1996) deals with this by looking at interval probabilities, but this turns out to rarely be useful in practice because the intervals corresponding to inference results are generally far too wide. Walley’s (1991) imprecise probabilities are more powerful but have a similar weakness, and we will discuss them in more detail in Chapter 4; they also have not been integrated into any sort of powerful, general, probabilistic logic framework, though integrating them into PLN if one wished to do so would not be problematic, as will become clear.
• Various sorts of truth-values are considered, including single values, intervals, and whole probability distributions, but the problem of finding the right way to summarize a probability distribution for logical inference without utilizing too much memory or sacrificing too much information has not been adequately resolved (and this is what we have tried to resolve with the “indefinite probabilities” utilized in PLN).
• The general probabilistic handling of intensional, temporal, and causal inference is not addressed. Of course, these topics are handled in various specialized theories; e.g., Pearl’s causal networks (2000), but there is no general theory of probabilistic intensional, temporal, or causal logic; yet the majority of commonsense logical inference involves these types of reasoning.
• The existing approaches to intermixing probabilistic truth-values with existential and universal quantification are conceptually flawed and often do not yield pragmatically useful results.

All in all, in terms of Halpern’s general formalism for what we call first-order extensional logic, what PLN constitutes is

• A specific compacted representation of sets of probability distributions (the indefinite truth-value)
• A specific way of deploying heuristic independence assumptions; e.g., within the PLN deduction and revision rules
• A way of counting the amount of evidence used in an inference (which is used in the revision rule, which itself uses amount of evidence together with heuristic independence assumptions)

But much of the value of PLN lies in the ease with which it extends beyond first-order extensional logic. Due to the nature of the conceptual and mathematical formalism involved, the same essential inference rules and formulas used for first-order extensional logic are extended far more broadly, to deal with intensional, temporal, and causal logic, and to deal with abstract higher-order inference involving complex predicates, higher-order functions, and universal, existential, and fuzzy quantifiers.

### Why Term Logic?

One of the major ways in which PLN differs from traditional approaches to probabilistic logic (and one of the secrets of PLN’s power) is its reliance on a formulation of logic called “term logic.” The use of term logic is essential, for instance, to PLN’s introduction of cognitively natural independence assumptions and to PLN’s easy extension of first-order extensional inference rules to more general and abstract domains.

Predicate logic and term logic are two different but related forms of logic, each of which can be used both for crisp and uncertain logic. Predicate logic is the most familiar kind, where the basic entity under consideration is the “predicate,” a function that maps argument variables into Boolean truth-values. The argument variables are quantified universally or existentially.

On the other hand, in term logic, which dates back at least to Aristotle and his notion of the syllogism, the basic element is a subject-Predicate statement, denotable

${\displaystyle A\rightarrow B}$

where ${\displaystyle \rightarrow }$ denotes a notion of inheritance or specialization. Logical inferences take the form of “syllogistic rules,” which give patterns for combining statements with matching terms. (We don’t use the ${\displaystyle \rightarrow }$ notation much in PLN, because it’s not sufficiently precise for PLN purposes, since PLN introduces many varieties of inheritance; but we will use the ${\displaystyle \rightarrow }$ notation in this section because here we are speaking about inheritance in term logic in general rather than about PLN in particular.) Examples are the deduction, induction, and abduction rules:

• A${\displaystyle \rightarrow }$B
• B${\displaystyle \rightarrow }$C
• ${\displaystyle \vdash }$
• A${\displaystyle \rightarrow }$C
• A${\displaystyle \rightarrow }$B
• A${\displaystyle \rightarrow }$C
• ${\displaystyle \vdash }$
• B${\displaystyle \rightarrow }$C
• A${\displaystyle \rightarrow }$C
• B${\displaystyle \rightarrow }$C
• ${\displaystyle \vdash }$
• A${\displaystyle \rightarrow }$B

When we get to defining the truth-value formulas corresponding to these inference rules, we will observe that deduction is infallible in the case of absolutely certain premises, but uncertain in the case of probabilistic premises; while abduction and induction are always fallible, even given certain premises. In fact we will derive abduction and induction from the combination of deduction with a simple rule called inversion

${\displaystyle A\rightarrow B}$

${\displaystyle |-}$

${\displaystyle B\rightarrow A}$

whose truth-value formula derives from Bayes rule.

Predicate logic has proved to deal more easily with deduction than with induction, abduction, and other uncertain, fallible inference rules. On the other hand, term logic can deal quite elegantly and simply with all forms of inference. Furthermore, the predicate logic formulation of deduction proves less amenable to “probabilization” than the term logic formulation. It is for these reasons, among others, that the foundation of PLN is drawn from term logic rather than from predicate logic. PLN begins with a term logic foundation, and then adds on elements of probabilistic and combinatory logic, as well as some aspects of predicate logic, to form a complete inference system, tailored for easy integration with software components embodying other (not explicitly logical) aspects of intelligence.

Sommers and Engelbretsen (2000) have given an excellent defense of the value of term logic for crisp logical inference, demonstrating that many pragmatic inferences are far simpler in term logic formalism than they are in predicate logic formalism. On the other hand, the pioneer in the domain of uncertain term logic is Pei Wang (Wang 1996), to whose NARS uncertain term logic based reasoning system PLN owes a considerable debt. To frame the issue in terms of our above discussion of PLN’s relation to traditional probabilistic logic approaches, we may say we have found that the formulation of appropriate heuristics to guide probabilistic inference in cases where adequate dependency information is not available, and appropriate methods to extend first-order extensional inference rules and formulas to handle other sorts of inference, are both significantly easier in a term logic rather than predicate logic context. In these respects, the use of term logic in PLN is roughly a probabilization of the use of term logic in NARS; but of course, there are many deep conceptual and mathematical differences between PLN and NARS, so that the correspondence between the two theories in the end is more historical and theory-structural, rather than a precise correspondence on the level of content.

## PLN Knowledge Representation and Inference Rules

In the next few sections of this Introduction, we review the main topics covered in the book, giving an assemblage of hints as to the material to come. First, Chapter 2 describes the knowledge representation underlying PLN, without yet saying anything specific about the management of numbers quantifying uncertainties. A few tricky issues occur here, meriting conceptual discussion. Even though PLN knowledge representation is not to do with uncertain inference per se, we have found that without getting the knowledge representation right, it is very difficult to define uncertain inference rules in an intuitive way. The biggest influence on PLN’s knowledge representation has been Wang’s NARS framework, but there are also some significant deviations from Wang’s approach.

PLN knowledge representation is conveniently understood according to two dichotomies: extensional vs. intensional, and first-order vs. higher-order. The former is a conceptual (philosophical/cognitive) distinction between logical relationships that treat concepts according to their members versus those that treat concepts according to their properties. In PLN extensional knowledge is treated as more basic, and intensional knowledge is defined in terms of extensional knowledge via the addition of a specific mathematics of intension (somewhat related to information theory). This is different from the standard probabilistic approach, which contains no specific methods for handling intension and also differs from, e.g., Wang’s approach in which intension and extension are treated as completely symmetric, with neither of them being more basic or derivable from the other.

The first-order versus higher-order distinction, on the other hand, is essentially a mathematical one. First-order, extensional PLN is a variant of standard term logic, as originally introduced by Aristotle in his Logic and recently elaborated by theorists such as Wang (1996) and Sommers and Engelbretsen (2000). First-order PLN involves logical relationships between terms representing concepts, such as

Inheritance cat animal
ExtensionalInheritance Pixel_444 Contour_7565


(where the notation is used that R A B denotes a logical relationship of type R between arguments A and B). A typical first-order PLN inference rule is the standard term-logic deduction rule

A${\displaystyle \rightarrow }$B
B${\displaystyle \rightarrow }$C
|-
A${\displaystyle \rightarrow }$C


which in PLN looks like

ExtensionalInheritance A B
ExtensionalInheritance B C
|-
ExtensionalInheritance A C


As well as purely logical relationships, first-order PLN also includes a fuzzy set membership relationship, and specifically addresses the relationship between fuzzy set membership and logical inheritance, which is closely tied to the PLN concept of intension. In the following text we will sometimes use the acronym FOI to refer to PLN First Order Inference.

Higher-order PLN, on the other hand (sometimes called HOI, for Higher Order Inference), has to do with functions and their arguments. Much of higher-order PLN is structurally parallel to first-order PLN; for instance, implication between statements is largely parallel to inheritance between terms. However, a key difference is that most of higher-order PLN involves either variables or higher-order functions (functions taking functions as their arguments). So for instance one might have

ExtensionalImplication
Inheritance $X cat Evaluation eat ($X, mice)


(using the notation that

R
A
B


denotes the logical relationship R applied to the arguments A and B). Here Evaluation is a relationship that holds between a predicate and its argument-list, so that, e.g.,

Evaluation eat (Sylvester, mice)


means that the list (Sylvester, mice) lies within the set of ordered pairs characterizing the eat relationship. The parallel of the first-order extensional deduction rule given above would be a rule

ExtensionalImplication A B
ExtensionalImplication B C
|-
ExtensionalImplication A C


where the difference is that in the higher-order inference case the tokens A, B, and C denote either variable-bearing expressions or higher-order functions. Some higher-order inference rules involve universal or existential quantifiers as well. While first-order PLN adheres closely to the term logic framework, higher-order PLN is better described as a mix of term logic, predicate logic, and combinatory logic. The knowledge representation is kept flexible, as this seems to lead to the simplest and most straightforward set of inference rules.

## Truth-value Formulas

We have cited above the conceptual reasons why we have made PLN a probabilistic inference framework, rather than using one of the other approaches to uncertainty quantification available in the literature. However, though we believe in the value of probabilities we do not believe that the conventional way of using probabilities to represent the truth-values of propositions is adequate for pragmatic computational purposes. One of the less conventional aspects of PLN is the quantification of uncertainty using truth-values that contain at least two components, and usually more (in distinction from the typical truth-value used in probability theory, which is a single number: a probability). Our approach here is related to earlier multi-component truth-value approaches due to Keynes (2004), Wang (2006), Walley (1991), and others, but is unique in its particulars.

The simplest kind of PLN truth-value, called a SimpleTruthValue, consists of a pair of numbers <s,w> called a strength and a confidence. The strength value is a probability; the confidence value is a measure of the amount of certainty attached to the strength value. Confidence values are normalized into [0,1].

For instance <.6,1> means a probability of .6 known with absolute certainty. <.6,.2> means a probability of .6 known with a very low degree of certainty. <.6,0> means a probability of .6 known with a zero degree of certainty, which indicates a meaningless strength value, and is equivalent to <x,0> for any other probability value x.

Another type of truth-value, more commonly used as the default within PLN, is the IndefiniteTruthValue. We introduce the mathematical and philosophical foundations of IndefiniteTruthValues in Chapter 4. Essentially a hybridization of Walley’s imprecise probabilities and Bayesian credible intervals, indefinite probabilities quantify truth-values in terms of four numbers <L, U, b, k>: an interval [L,U], a credibility level b, and an integer k called the “lookahead.” IndefiniteTruthValues provide a natural and general method for calculating the “weight-of-evidence” underlying the conclusions of uncertain inferences. We ardently believe that this approach to uncertainty quantification may be adequate to serve as an ingredient of powerful artificial general intelligence.

Beyond the SimpleTruthValues and IndefiniteTruthValues mentioned above, more advanced types of PLN truth-value also exist, principally “distributional truth-values” in which the strength value is replaced by a matrix approximation to an entire probability. Note that this, then, provides for three different granularities of approximations to an entire probability distribution. A distribution can be most simply approximated by a single number, somewhat better approximated by a probability interval, and even better approximated by an entire matrix.

Chapter 5 takes the various inference rules defined in Chapter 2, and associates a “strength value formula” with each of them (a formula determining the strength of the conclusion based on the strengths of the premises). For example, the deduction rule mentioned above is associated with two strength formulas, one based on an independence assumption and the other based on a different “concept geometry based assumption. The independence-assumption-based deduction strength formula looks like

B <sB> C <sC>
ExtensionalInheritance A B <SAB>
ExtensionalInheritance B C <SBC>
|-
ExtensionalInheritance A C <SAC>
SAC = SAB SBC + (1-SAB)( SC SB SBC ) / (1- SB )


This particular rule is a straightforward consequence of elementary probability theory. Some of the other formulas are equally straightforward, but some are subtler and require heuristic reasoning beyond standard probabilistic tools like independence assumptions. Since simple truth-values are the simplest and least informative of our truth-value types; they provide quick, but less accurate, assessments of the resulting strength and confidence values.

We reconsider these strength formulas again in Chapter 6, extending the rules to IndefiniteTruthValues. We also illustrate in detail how indefinite truth-values provide a natural approach to measuring weight-of-evidence. IndefiniteTruthValues can be thought of as approximations to entire distributions, and so provide an intermediate level of accuracy of strength and confidence.

As shown in Chapter 7, PLN inference formulas may also be modified to handle entire distributional truth-values. Distributional truth-values provide more information than the other truth-value types. As a result, they may also be used to yield even more accurate assessments of strength and confidence.

The sensitivity to error of several inference rule formulas for various parameter values is explored in Chapter 8. There we provide a fairly detailed mathematical and graphical examination of error magnification. We also study the possibility of deterministic chaos arising from PLN inference.

We introduce higher-order inference (HOI) in Chapter 10, where we describe the basic HOI rules and strength formulas for both simple truth-values and indefinite truth-values. We consider both crisp and fuzzy quantifiers, using indefinite probabilities, in Chapter 11; treat intensional inference in Chapter 12; and inference control in Chapter 13. Finally, we tackle the topics of temporal and causal inference in Chapter 14.

## Implementing and Applying PLN

The goal underlying the theoretical development of PLN has been the creation of practical software systems carrying out complex, useful inferences based on uncertain knowledge and drawing uncertain conclusions. Toward that end we have implemented most of the PLN theory described in this book as will briefly be described in Chapter 13, and used this implementation to carry out simple inference experiments involving integration with external software components such as a natural language comprehension engine and a 3D simulation world.

Chapter 14 reviews some extensions made to basic PLN in the context of these practical applications, which enable PLN to handle reasoning about temporal and causal implications. Causal inference in particular turns out to be conceptually interesting, and the approach we have conceived relates causality and intension in a satisfying way.

By far the most difficult aspect of designing a PLN implementation is inference control, which we discuss in Chapter 13. This is really a foundational conceptual issue rather than an implementational matter per se. The PLN framework just tells you what inferences can be drawn; it doesn’t tell you what order to draw them in, in which contexts. Our PLN implementation utilizes the standard modalities of forward-chaining and backward-chaining inference control. However, the vivid presence of uncertainty throughout the PLN system makes these algorithms more challenging to use than in a standard crisp inference context. Put simply, the search trees expand unacceptably fast, so one is almost immediately faced with the need to use clever, experience-based heuristics to perform pruning.

The issue of inference control leads into deep cognitive science issues that we briefly mention here but do not fully explore, because that would lead too far afield from the focus of the book, which is PLN in itself. One key conceptual point that we seek to communicate, however, is that uncertain inference rules and formulas, on their own, do not compose a comprehensive approach to artificial intelligence. To achieve the latter, sophisticated inference control is also required, and controlling uncertain inference is difficult – in practice, we have found, requiring ideas that go beyond the domain of uncertain inference itself. In principle, one could take a purely probability-theoretic approach to inference control – choosing inference steps based on the ones that are most likely to yield successful conclusions based on probabilistic integration of all the available evidence. However, in practice this does not seem feasible given the current state of development of applied probability theory. Instead, in our work with PLN so far, we have taken a heuristic and integrative approach, using other non-explicitly-probabilistic algorithms to help prune the search trees implicit in PLN inference control.

As for applications, we have applied PLN to the output of a natural language processing subsystem, using it to combine premises extracted from different biomedical research abstracts to form conclusions embodying medical knowledge not contained in any of the component abstracts. We have also used PLN to learn rules controlling the behavior of a humanoid agent in a 3D simulation world; for instance, PLN learns to play “fetch” based on simple reinforcement learning stimuli.

Our current research involves extending PLN’s performance in both these areas, and bringing the two areas together by using PLN to help the Novamente Cognition Engine carry out complex simulation-world tasks involving a combination of physical activity and linguistic communication. Quite probably this ongoing research will involve various improvements to be made to the PLN framework itself. Our goal in articulating PLN has not been to present an ultimate itself. Our goal in articulating PLN has not been to present an ultimate and final approach to uncertain inference, but rather to present a workable approach that is suitable for carrying out uncertain inference comprehensively and reasonably well in practical contexts. As probability theory and allied branches of mathematics develop, and as more experience is gained applying PLN in practical contexts, we expect the theory to evolve and improve.

## Relationship of PLN to Other Approaches to Uncertain Inference

Finally, having sketched the broad contours of PLN theory and related it to more traditional approaches to probabilistic logic, we now briefly discuss the relationship between PLN and other approaches to logical inference. First, the debt of PLN to various standard frameworks for crisp logical inference is clear. PLN’s knowledge representation, as will be made clear in Chapter 2, is an opportunistically assembled amalgam of formalisms chosen from term logic, predicate logic and combinatory logic. Rather than seeking a pure and minimal formalism, we have thought more like programming language designers and sought a logical formalism that allows maximally compact and comprehensible representation of a wide variety of useful logical structures.

Regarding uncertainty, as noted above, as well as explicit approaches to the problem of unifying probability and logic the scientific literature contains a number of other relevant ideas, including different ways to quantify uncertainty and to manipulate uncertainty once quantified. There are non-probabilistic methods like fuzzy logic, possibility theory, and NARS. And there is a variety of probabilistic approaches to knowledge representation and reasoning that fall short of being full-on “probabilistic logics,” including the currently popular Bayes nets, which will be discussed in more depth below, and Walley’s theory of imprecise probabilities (Walley 1991), which has led to a significant literature (ISIPTA 2001, 2003, 2005, 2007), and has had a significant inspirational value in the design of PLN’s approach to confidence estimation, as will be reviewed in detail in Chapters 4, 6, and 10.

Overall, regarding the representation of uncertainty, PLN owes the most to Pei Wang’s NARS approach and Walley’s theory of imprecise probabilities. Fuzzy set theory ideas are also utilized in the specific context of the PLN Member relationship. However, we have not found any of these prior approaches to uncertainty quantification to be fully adequate, and so the PLN approach draws from them ample inspiration but not very many mathematical details.

We now review the relationship of PLN to a few specific approaches to uncertainty quantification and probabilistic inference in a little more detail. In all cases the comments given here are high-level and preliminary, and the ideas discussed will be much clearer to the reader after they have read the later chapters of this book and understand PLN more fully.

### PLN and Fuzzy Set Theory

Fuzzy set theory has proved a pragmatically useful approach to quantifying many kinds of relationships (Zadeh 1965, 1978), but we believe that its utility is fundamentally limited. Ultimately, we suggest, the fuzzy set membership degree is not a way of quantifying uncertainty – it is quantifying something else: it is quantifying partial membership. Fuzzy set membership is used in PLN as the semantics of the truth-values of special logical relationship types called Member relationships. These fuzzy Member relationships may be used within PLN inference, but they are not considered the same as logical relationships such as Inheritance or Similarity relationships whose truth-values quantify degrees of uncertainty.

Some (though nowhere near all) of the fuzzy set literature appears to us to be semantically confused regarding the difference between uncertainty and partial membership. In PLN we clearly distinguish between

• Jim belongs to degree .6 to the fuzzy set of tall people. (MemberLink semantics)
• Jim shares .6 of the properties shared by people belonging to the set of tall people (where the different properties may be weighted). (IntensionalInheritanceLink semantics)
• Jim has a .6 chance of being judged as belonging to the set of tall people, once more information about Jim is obtained (where this may be weighted as to the degree of membership that is expected to be estimated once the additional information is obtained). (IntensionalInheritanceLink, aka Subset Link, semantics)
• Jim has an overall .6 amount of tallness, defined as a weighted average of extensional and intensional information. (Inheritance Link semantics)

We suggest that the fuzzy, MemberLink semantics is not that often useful, but do recognize there are cases where it is valuable; e.g., if one wishes to declare that a stepparent and stepchild are family members with fuzzy degree .8 rather than 1.

In terms of the above discussion of the foundations of probability theory we note that partial membership assignments need not obey Cox’s axioms and need not be probabilities – which is fine, as they are doing something different, but also limits the facility with which they can be manipulated. In PLN, intensional probabilities are used for many of the purposes commonly associated with fuzzy membership values, and this has the advantage of keeping more things within a probabilistic framework.

### PLN and NARS

Pei Wang’s NARS approach has already been discussed above and will pop up again here and there throughout the text; furthermore, Appendix A1 presents a comparison of some of the first-order PLN truth-value formulas with corresponding NARS formulas. As already noted, there is a long historical relationship between PLN and NARS; PLN began as part of a collaboration with NARS’s creator Pei Wang as an attempt to create a probabilistic analogue to NARS. PLN long ago diverged from its roots in NARS and has grown in a very different direction, but there remain many similarities. Beneath all the detailed similarities and differences, however, there is a deep and significant difference between the two, which is semantic: PLN’s semantics is probabilistic, whereas NARS’s semantics is intentionally and definitively not.

PLN and NARS have a similar division into first-order versus higher-order inference, and have first-order components that are strictly based on term logic. However, PLN’s higher-order inference introduces predicate and combinatory logic ideas, whereas NARS’s higher-order inference is also purely term logic based. Both PLN and NARS include induction, deduction, and abduction in their first-order components, with identical graphical structures; in PLN, however, induction and abduction are derived from deduction via Bayes rule, whereas in NARS they have their own completely independent truth-value functions. Both PLN and NARS utilize multi-component truth-values, but the semantics of each component is subtly different, as will be reviewed in appropriate points in the text to follow.

### PLN and Bayes Nets

Bayes nets are perhaps the most popular contemporary approach to uncertain inference. Because of this, we here offer a few more detailed comments on the general relationship between PLN and Bayes nets. Of course, the actual relationship is somewhat subtle and will be clear to the reader only after completing the exposition of PLN.

Traditional Bayesian nets assume a tree structure for events, which is unrealistic in general, but in recent years there has been a batch of work on “loopy Bayesian networks” in which standard Bayesian net information propagation is applied to potentially cyclic graphs of conditional probability. Some interesting alternatives to the loopy Bayesian approach have also been proposed, including one that uses a more advanced optimization algorithm within the Bayesian net framework.

Bayes nets don’t really contain anything comparable to the generality of PLN higher-order inference. However, in the grand scheme of things, first-order PLN is not all that tremendously different from loopy Bayesian nets and related schemes. In both cases one is dealing with graphs whose relationships denote conditional probabilities, and in both cases one is using a kind of iterative relaxation method to arrive at a meaningful overall network state.

If one took a forest of loopy Bayes nets with imprecise probabilities, and then added some formalism to interface it with fuzzy, predicate, and combinatory logic, then one might wind up with something reasonably similar to PLN. We have not taken such an approach but have rather followed the path that seemed to us more natural, which was to explicitly shape a probabilistic inference framework based on the requirements that we found important for our work on integrative AI.

There are many ways of embodying probability theory in a set of data structures and algorithms. Bayes nets are just one approach. PLN is another approach and has been designed for a different purpose: to allow basic probabilistic inference to interact with other kinds of inference such as intensional inference, fuzzy inference, and higher-order inference using quantifiers, variables, and combinators. We have found that for the purpose of interfacing basic probabilistic inference with these other sorts of inference, the PLN approach is a lot more convenient than Bayes nets or other more conventional approaches.

Another key conceptual difference has to do with a PLN parameter called the “context.” In terms of probability theory, one can think of a context as a universe of discourse. Rather than attempting to determine a (possibly non-existent) universal probability distribution that has desired properties within each local domain, PLN creates local probability distributions based on local contexts. The context parameter can be set to Universal (everything the system has ever seen), Local (only the information directly involved in a given inference), or many levels in between.

Yet another major conceptual difference is that PLN handles multivariable truth-values. Its minimal truth-value object has two components: strength and weight of evidence. Alternatively, it can use probability distributions (or discrete approximations thereof) as truth-values. This makes a large difference in the handling of various realistic inference situations. For instance, the treatment of “weight of evidence” in PLN is not a purely mathematical issue, but reflects a basic conceptual issue, which is that (unlike most probabilistic methods) PLN does not assume that all probabilities are estimated from the same sample space. It makes this assumption provisionally in some cases, but it doesn’t make it axiomatically and comprehensively.

With the context set to Universal, and with attention restricted to the strength component of truth-values, what we have in PLN-FOI is – speaking conceptually rather than mathematically – a different way of doing the same thing that loopy Bayes networks (BN) and its competitors are trying to do. PLN, loopy BN, and other related methods are all viewable as optimization algorithms trying to relax into a condition giving the “correct probability distribution,” and at some risk of settling into local optima instead. But the ability to use more flexible truth-values, and to use local contexts as appropriate, makes a very substantial difference in practice. This is the kind of difference that becomes extremely apparent when one seeks to integrate probabilistic inference with other cognitive processes. And it’s the kind of difference that is important when trying to extend one’s reasoning system from simple inferences to extremely general higher-order inference – an extension that has succeeded within PLN, but has not been successfully carried out within these other frameworks.

## Toward Pragmatic Probabilistic Inference

Perhaps the best way to sum up the differences between PLN and prior approaches to (crisp or uncertain) inference is to refer back to the list of requirements given toward the start of this Introduction. These requirements are basically oriented toward the need for an approach to uncertain inference that is adequate to serve as the core of a general-purpose cognition process – an approach that can handle any kind of inference effectively, efficiently, and in an uncertainty-savvy way.

Existing approaches to crisp inference are not satisfactory for the purposes of general, pragmatic, real-world cognition, because they don’t handle uncertainty efficiently and gracefully. Of course, one can represent uncertainties in predicate logic – one can represent anything in predicate logic – but representing them in a way that leads to usefully rapid and convenient inference incorporating uncertainties intelligently is another matter.

On the other hand, prior approaches to uncertain inference have universally failed the test of comprehensiveness. Some approaches, such as Bayes nets and fuzzy set theory, are good at what they do but carry out only very limited functions compared to what is necessary to fulfill the inference needs of a general-purpose cognitive engine. Others, such as imprecise probability theory, are elegant and rigorous but are so complex that the mathematics needed to apply them in practical situations has not yet been resolved. Others, such as NARS and Dempster-Shafer theory, appear to us to have fundamental conceptual flaws in spite of their interesting properties. And still others, such as traditional probabilistic logic as summarized by Halpern and Hailperin, fail to provide techniques able to deal with the scale, incompleteness, and erroneousness typifying real-world inference situations.

In sum, we do not propose PLN as an ultimate and perfect uncertain inference framework, only as an adequate one – but we do suggest that, in its adequacy, PLN distinguishes itself from the alternatives currently available. As noted above, we suspect that the particulars of the PLN framework will evolve considerably as PLN is utilized for more and more pragmatic inference tasks, both on its own and within integrative AI systems.

# Chapter 2: Knowledge Representation

Abstract In chapter 2, we review the basic formalism of PLN knowledge representation in a way that is relatively independent of the particularities of PLN truth-value manipulation. Much of this material has nothing explicitly to do with probability theory or uncertainty management; it merely describes a set of conventions for representing logical knowledge. However, we also define some of the elements of PLN truth-value calculation here, insofar as is necessary to define the essential meanings of some of the basic PLN constructs.

## Basic Terminology and Notation

The basic players in PLN knowledge representation are entities called terms and relationships (atomic formulae). The term Atom will refer to any element of the set containing both terms and relationships. The hierarchy of PLN Atoms begins with a finite set S of elementary terms. (In an AI context, these may be taken as referring to atomic perceptions or actions, and mathematical structures.) The set of ordered and unordered subsets of S is then constructed, and its elements are also considered as terms. Relationships are then defined as tuples of terms, and higher-order relationships are defined as predicates or functions acting on terms or relationships.

Atoms are associated with various data items, including

• Labels indicating type; e.g., a term may be a Concept term or a Number term; a relationship may be an Inheritance relationship or a Member relationship
• Packages of numbers representing “truth-value” (more on that later)
• In some cases, Atom-type-specific data (e.g., Number terms are associated with numbers; Word terms are associated with character strings)

We will sometimes refer to uncertain truth-values here in a completely abstract way, via notation such as <t>. However, we will also use some specific truth-value types in a concrete way:

• “strength” truth-values, which consist of single numbers; e.g., <s> or <.8>. Usually strength values denote probabilities but this is not always the case. The letter s will be habitually used to denote strength values.
• SimpleTruthValues, which consist of pairs of numbers. These pairs come in two forms:
1. the default, <s,w>, where s is a strength and w is a “weight of evidence” – the latter being a number in [0,1] that tells you, qualitatively, how much you should believe the strength estimate. The letter w will habitually be used to denote weight of evidence values.
2. <s,N>, where N is a “count” – a positive number telling you, qualitatively, the total amount of evidence that was evaluated in order to assess s. There is a heuristic formula interrelating w and N, w=N/(N+k) where k is an adjustable parameter. The letter N will habitually be used to denote count. If the count version rather than the weight of evidence version is being used, this will be explicitly indicated, as the former version is the default.
• IndefiniteTruthValues, which quantify truth-values in terms of four numbers ${\displaystyle <[''L'',''U''],''b'',''k''>}$, an interval ${\displaystyle [''L'',''U'']}$, a credibility level b, and an integer k called the lookahead. While the semantics of IndefiniteTruthValues are fairly complex, roughly speaking they quantify the idea that after k more observations there is a probability b that the conclusion of the inference will appear to lie in the final interval${\displaystyle [''L'',''U'']}$. The value of the integer k will often be considered a system-wide constant. In this case, IndefiniteTruthValues will be characterized more simply via the three numbers${\displaystyle <[''L'',''U''],''b''>}$.
• DistributionalTruthValues, which are discretized approximations to entire probability distributions. When using DistributionalTruthValues, PLN deduction reduces simply to matrix multiplication, and PLN inversion reduces to matrix inversion.[1]

The semantics of these truth-values will be reviewed in more depth in later chapters, but the basic gist may be intuitable from the above brief comments. PLN inference rules are associated with particular types of terms and relationships; for example, the deduction rule mentioned in the Introduction is associated with ExtensionalInheritance and Inheritance relationships. At the highest level we may divide the set of PLN relationships into the following categories, each of which corresponds to a set of different particular relationship types:

• Fuzzy membership (the Member relationship)
• First-order logical relationships
• Higher-order logical relationships
• Containers (lists and sets)
• Function execution (the ExecutionOutput relationship)

To denote a relationship of type R, between Atoms A and B, with truth-value t, we write

R A B <t>

If A and B have long names, we may use the alternate notation

R <t>
A
B


which lends itself to visually comprehensible nesting; e.g.,

R <t>
A
R1
C
D


Similarly, to denote a term A with truth-value t, we write

A <t>

For example, to say that A inherits from B with probability .8, we write

Inheritance A B <.8>

To say that A inherits from B with IndefiniteTruthValue represented by <[.8,.9], .95>, we write

Inheritance A B <[.8,.9],.95>

(roughly, as noted above, the [.8, .9] interval represents an interval probability and the .95 represents a credibility level).

We will also sometimes use object-field notation for truth-value elements, obtaining, for example, the strength value object associated with an Atom

(Inheritance A B).strength = [.8,.9]

or the entire truth-value, using .tv

(Inheritance A B).tv = <[.8,.9], .9, 20>.

Finally, we will sometimes use a semi-natural-language notation, which will be introduced a little later on, when we first get into constructs of sufficient complexity to require such a notation.

Notes

1. We have so far developed two flavors of DistributionalTruthValues, namely StepFunctionTruthValues and PolynomialTruthValues.

## Context

PLN TruthValues are defined relative to a Context. The default Context is the entire universe, but this is not usually a very useful Context to consider. For instance, many terms may be thought of as denoting categories; in this case, the strength of a term in a Context denotes the probability that an arbitrary entity in the Context is a member of the category denoted by the term.

Contextual relationships are denoted by Context relationships, introduced in Chapter 10 The semantics of

Context
C
R A B <t>


is simply

R (A AND C) (B AND C) <t>

Most of the discussion in following chapters will be carried out without explicit discussion of the role of context, and yet due to the above equivalence the conclusions will be usable in the context of contextual inference.

## Fuzzy Set Membership

As a necessary preliminary for discussing the PLN logical relationships, we now turn to the Member relationship. The relationship

Member A B <t>


spans two terms where the target B cannot be an atomic term or a relationship, but must be a term denoting a set (be it a set of atomic terms, a set of composite terms, a set of relationships, etc.). In essence, the Member relationship of PLN is the familiar “fuzzy set membership” (Zadeh 1989). For instance, we may say

Member Ben Goertzel_family <1>
Member Tochtli Goertzel_family <.5>


(Tochtli is the dog of a Goertzel family member.) When a Member relationship has a value between 0 and 1, as in the latter example, it is interpreted as a fuzzy value rather than a probabilistic value. PLN is compatible with many different algebras for combining fuzzy truth-values, including the standard min and max operators according to which if

Member Ben A <r>
Member Ben B <s>


then

Member Ben (A OR B) <max(r,s)>
Member Ben (A AND B) <min(r,s)>


When to use fuzzy set membership versus probabilistic inheritance is a somewhat subtle issue that will be discussed later on. For instance, the fuzzy set community is fond of constructs such as

Member Ben tall <.75>


which indicates that Ben is somewhat tall. But, while this is a correct PLN construct, it is also interesting in PLN to say

IntensionalInheritance Ben tall <.75>


which states that (roughly speaking) Ben shares .75 of the characteristic properties of tall things. This representation allows some useful inferences that the Member relationship does not; for instance, inheritance relationships are probabilistically transitive whereas Member relationships come without any comparably useful uncertain transitivity algebra. (As a parenthetical note, both of these are actually bad examples; they should really be

Context
People
Member Ben tall <.75>
Context
People
IntensionalInheritance Ben tall <.75>


because Ben’s .75 tallness, however you define it, is not meaningful in comparison to the standard of the universe, but only in comparison to the standard of humans.)

We extend this notion of fuzzy set membership to other truth-value types as well. For instance, using IndefiniteTruthValues

MemberLink Astro Jetson_family <[.8,1],.95,2>


would mean that after 2 more observations of Astro the assessed fuzzy membership value for

MemberLink Astro Jetson_family


would lie within [.8,1] with confidence .95.

## First-Order Logical Relationships

In this section, we begin our review of the PLN first-order logical relationship types, which are the following:

• Relationships representing first-order conditional probabilities:
• Subset (extensional)
• Inheritance (mixed)
• IntensionalInheritance (intensional)
• Relationships representing symmetrized first-order conditional probabilities:
• ExtensionalSimilarity (extensional)
• Similarity (mixed)
• IntensionalSimilarity (intensional)
• Relationships representing higher-order conditional probabilities:
• ExtensionalImplication
• Implication (mixed)
• IntensionalImplication
• Relationships representing symmetrized higher-order conditional probabilities:
• ExtensionalEquivalence
• Equivalence (mixed)
• IntensionalInheritance

The semantics of the higher-order logical relationships will be described briefly in Section 2.6 of this chapter and in more depth later on. The truth-value formulas for inference on these higher-order relationships are the same as those for the corresponding first-order relationships. PLN-HOI (higher-order inference) also involves a number of other relationships, such as Boolean operators (AND, OR and NOT), the SatisfyingSet operator, and an infinite spectrum of quantifiers spanning the range from ForAll to ThereExists.

### The Semantics of Inheritance

We now explain in detail the semantics of the key PLN relationship type, Inheritance. Since inheritance in PLN represents the synthesis of extensional and intensional information, we will begin by considering extensional and intensional inheritance in their pure forms.

#### Subset Relationships

Firstly, a Subset relationship represents a probabilistic subset relationship; i.e., purely extensional inheritance. If we have

Subset A B <s>

(where s, a strength value, is a single number in [0,1]) where A and B are two terms denoting sets, this means

${\displaystyle P(B|A)=s}$

or more precisely

${\displaystyle P(X\;{\text{in}}\;B|X\;{\text{in}}\;A)=s}$

If “in” is defined in terms of crisp Member relationships (with strength in each case either 0 or 1) this means

${\displaystyle P({\text{Member}}\;X\;B<1>|{\text{Member}}\;X\;A<1>)=s}$

On the other hand, if “in” is defined in terms of fuzzy Member relationships then one must define s as

${\displaystyle {\frac {\sum \limits _{X}f(({\text{Member}}\;X\;B).{\text{strenght}},\;({\text{Member}}\;X\;A).{\text{strength}})}{\sum \limits _{X}({\text{Member}}\;X\;A).{\text{strenght}}}}}$

where f(x,y) denotes the fuzzy set intersection function. Two options in this regard are

${\displaystyle f(x,y)=min(x,y)}$

${\displaystyle f(x,y)=x*y}$

In our current practical work with PLN we’re using the min function.

As before, we treat other truth-value types in an analogous manner. For example, we interpret

Subset A B <[L,U] 0.9, 20>


as ${\displaystyle P(x\;{\text{in}}\;B|\;x\;{\text{in}}\;A)\in [L,\;U]}$with confidence 0.9 after 20 more observations, where “in” is defined in terms of either crisp or fuzzy Member relationships as above.

#### Intensional Inheritance

The Subset relationship is what we call an extensional relationship— it relates two sets according to their members. PLN also deals with intensional relationships— relationships that relate sets according to the patterns that are associated with them. The mathematics of intensionality will be given in a later chapter, but here we will review the conceptual fundamentals.

First, we review the general notions of intension and extension. These have been defined in various ways by various logical philosophers, but the essential concepts are simple. The distinction is very similar to that between a word’s denotation and its connotation. For instance, consider the concept “bachelor.” The extension of “bachelor” is typically taken to be all and only the bachelors in the world (a very large set). In practical terms, it means all bachelors that are known to a given reasoning system, or specifically hypothesized by that system. On the other hand, the intension of “bachelor” is the set of properties of “bachelor,” including principally the property of being a man, and the property of being unmarried.

Some theorists would have it that the intension of “bachelor” consists solely of these two properties, which are “necessary and sufficient conditions” for bachelorhood; PLN’s notion of intension is more flexible, and it may include necessary and sufficient conditions but also other properties, such as the fact that most bachelors have legs, that they frequently eat in restaurants, etc. These other properties allow us to understand how the concept of “bachelor” might be stretched in

some contexts; for instance, if one read the sentence “Jane Smith was more of a bachelor than any of the men in her apartment building,” one could make a lot more sense of it using the concept “bachelors’” full PLN intension than one could make using only the necessary-and-sufficient-condition intension.

The essential idea underlying PLN’s treatment of intension is to associate both fish and whale with sets of properties, which are formalized as sets of patterns – fishPAT and whalePAT, the sets of patterns associated with fish and whales. We then interpret

IntensionalInheritance whale fish <.7>


as

Subset whalePAT fishPAT <.7>


We then define Inheritance proper as the disjunction of intensional and extensional (subset) inheritance; i.e.,

Inheritance A B <tv>


is defined as

OR <tv>
Subset A B
IntensionalInheritance A B


The nature of reasoning on Inheritance and IntensionalInheritance relationships will be reviewed in Chapter 12; prior to that we will use Subset and related extensional relationships in most of our examples.

Why do we think intensional relationships are worth introducing into PLN? This is a cognitive science rather than a mathematical question. We hypothesize that most human inference is done not using subset relationships, but rather using composite Inheritance relationships.

And, consistent with this claim, we suggest that in most cases the natural language relation “is a” should be interpreted as an Inheritance relation between individuals and sets of individuals, or between sets of individuals – not as a Subset relationship. For instance,

“Stripedog is a cat”


as conventionally interpreted is a combination extensional/intensional statement, as is

“Cats are animals.”


This statement means not only that examples of cats are examples of animals, but also that patterns in cats tend to be patterns in animals.

The idea that inheritance and implication in human language and cognition mix up intension and extension is not an original one— for example, it has been argued for extensively and convincingly by Pei Wang in his writings on NARS. However, embodying this conceptual insight Wang has outlined a different mathematics thatwe find awkward because it manages uncertainty in a non-probabilistic way. His approach seems to us to contradict the common sense embodied in Cox’s axioms, and also to lead to counterintuitive results in many practical cases. On the other hand, our approach is consistent with probability theory but introduces measures of association and pattern-intensity as additional concepts, and integrates them into the overall probabilistic framework of PLN.

Philosophically, one may ask why a pattern-based approach to intensional inference makes sense. Why, in accordance with Cox’s axioms, isn’t straightforward probability theory enough? The problem is – to wax semi-poetic for a moment – that the universe we live in is a special place, and accurately reasoning about it requires making special assumptions that are very difficult and computationally expensive to explicitly encode into probability theory. One special aspect of our world is what Charles Peirce referred to as “the tendency to take habits” (Peirce,1931-1958): the fact that “patterns tend to spread”; i.e., if two things are somehow related to each other, the odds are that there are a bunch of other patterns relating the two things. To encode this tendency observed by Peirce in probabilistic reasoning one must calculate P(A|B) in each case based on looking at the number of other conditional probabilities that are related to it via various patterns. But this is exactly what intensional inference, as defined in PLN, does. This philosophical explanation may seem somewhat abstruse – until one realizes how closely it ties in with human commonsense inference and with the notion of inheritance as utilized in natural language. Much more is said on this topic in Goertzel (2006).

#### Symmetric Logical Relationships

Inheritance is an asymmetric relationship; one may also define a corresponding symmetric relationship. Specifically one may conceptualize three such relationships, corresponding to Subset, IntensionalInheritance, and Inheritance:

ExtensionalSimilarity A B <tv>
tv.s = purely extensional estimate of
P(x ${\displaystyle \in }$ B & x ${\displaystyle \in }$ A | x ${\displaystyle \in }$ A OR x ${\displaystyle \in }$ B)

IntensionalSimilarity A B <tv>
tv.s = purely intensional estimate of
P(x ${\displaystyle \in }$ B & x ${\displaystyle \in }$ A | x ${\displaystyle \in }$ A OR x ${\displaystyle \in }$ B)

Similarity A B <tv>
tv.s = intensional/extensional estimate of
P(x ${\displaystyle \in }$ B ${\displaystyle \in }$ A | x ${\displaystyle \in }$ A OR x ${\displaystyle \in }$ B)


In each of these conceptual formulas, ${\displaystyle \in }$ denotes respectively for each case SubSet, IntensionalInheritance and (mixed) Inheritance

Elementary probability theory allows us to create truth-value formulas for these symmetric logical relationships from the truth-value formulas for the corresponding asymmetric ones. Therefore we will not say much about these symmetric relationships in this book; yet in practical commonsense reasoning they are very common.

## Term Truth-Values

We have discussed the truth-values of first-order logical relationships. Now we turn to a related topic, the truth-values of PLN terms. Compared to relationship truth-values, term truth-values are mathematically simpler but conceptually no less subtle.

Most simply, the truth-value of an entity A may be interpreted as the truth value of a certain Subset relationship:

A <tv>

Means

Subset Universe A <tv>


That is, the A.tv denotes the percentage of the “universe” that falls into category A. This is simple enough mathematically, but the question is: what is this “universe”? It doesn’t have to be the actual physical universe, it can actually be any set, considered as the “universal set” (in the sense of probability theory) for a collection of inferences. In effect, then, what we’ve called the Universe is really a kind of “implicit context.” This interpretation will become clear in Section 2.2 and Chapter 10 when we explicitly discuss contextual inference.

Sometimes one specifically wants to do inference within a narrow local context. Other times, one wants to do inference relative to the universe as a whole, and it’s in this case that things get tricky. In fact, this is one of the main issues that caused Pei Wang, the creator of the non-probabilistic NARS system that partially inspired PLN, to declare probability theory an unsound foundation for modeling human inference or designing computational inference systems. His NARS inference framework is not probability-theory based and hence does not require the positing of a universal set U.

Our attitude is not to abandon probability theory because of its U-dependence, but rather to explicitly acknowledge that probabilistic inference is context-dependent, and to acknowledge that context selection for inference is an important aspect of cognition. When a mind wants to apply probabilistic reasoning, nothing tells it a priori how to set this particular parameter (the size of the universal set, |U|), which makes a big difference in the results that reasoning gives. Rather, we believe, the context for an inference must generally be determined by non-inferential cognitive processes, aided by appropriate inference rules.

There are two extreme cases: Universal and Local context. In the Universal case, pragmatically speaking, the set U is set equal to everything the system has ever seen or heard of. (This may of course be construed in various different ways in practical AI systems.) In the Local case, the set U is set equal to the union of the premises involved in a given inference, with nothing else acknowledged at all. According to the algebra of the PLN deduction rule, as will be elaborated below, local contexts tend to support more speculative inferences, whereas in the Universal context only the best supported of inferences come out with nontrivial strength..

In some cases, the context for one probabilistic inference may be figured out by another, separate, probabilistic inference process. This can’t be a universal solution to the problem, however, because it would lead to an infinite regress. Ultimately one has to bottom out the regress, either by assuming a universal set U is given a priori via “hard wiring” (perhaps a hard-wired function that sets U adaptively based on experience) or by positing that U is determined by non-inferential processes.

If one wants to choose a single all-purpose U, one has to err on the side of inclusiveness. For instance, |U| can be set to the sum of the counts of all Atoms in the system. Or it can be set to a multiple of this, to account for the fact that the system cannot afford to explicitly represent all the entities it knows indirectly to exist.

It is not always optimal to choose a universal and maximal context size, however. Sometimes one wants to carry out inference that is specifically restricted to a certain context, and in that case choosing a smaller U is necessary in order to get useful results. For instance, if one is in the USA and is reasoning about the price of furniture, one may wish to reason only in the context of the USA, ignoring all information about the rest of the world.

Later on we will describe the best approach we have conceived for defining U in practice, which is based on an equation called the “optimal universe size formula.” This approach assumes that one has defined a set of terms that one wants to consider as a context (e.g., the set of terms pertaining to events or entities in the USA, or properties of the USA). One also must assume that for some terms A, B and C in this context-set, one has information about the triple-intersections P(A ${\displaystyle \cap }$ B ${\displaystyle \cap }$ C). Given these assumptions, a formula may be derived that yields the U-value that is optimal, in the sense of giving the minimum error for PLN deduction in that context. Note that some arbitrariness is still left here; one must somewhere obtain the context definition; e.g., decide that it’s intelligent to define U relative to the United States of America, or relative to the entire system’s entire experience, etc.

This formula for deriving the value of U is based on values called “count values”, representing numbers of observations underlying truth value estimates, and closely related to the confidence components of truth values. This means that the challenge of defining U ultimately bottoms out in the problem of count/confidence updating. In an integrative AI architecture, for example, two sorts of processes may be used for updating the confidence components of Atoms’ TruthValues. Inference can be used to modify count values as well as strength values, which covers the case where entities are inferred to exist rather than observed to exist. And in an architecture incorporating natural language processing, one can utilize “semantic mapping schemata,” which translate perceived linguistic utterances into sets of Atoms, and which may explicitly update the confidence components of truth values.

To take a crude example, if a sentence says “I am very sure cats are only slightly friendly”, this translates into a truth value with a low strength and high confidence attached to the Atoms representing the statement “cats are friendly.” An important question there is: What process learns these cognitive schemata carrying out semantic mapping? If they are learned by probabilistic inference, then they must be learned within some universal set U. The pragmatic answer we have settled on in our own work is that inference applied to schema learning basically occurs using a local context, in which the schema known to the system are assumed to be all there are. Some of these schemata learned with a local context are then used to manipulate the count variables of other Atoms, thus creating a larger context for other applications of inference within the system.

In our practical applications of PLN, we have found it is not that often that the most universal U known to the system is used. More often than not, inference involves some relatively localized context. For example, if the system is reasoning about objects on Earth it should use a U relativized to Earth’s surface, rather than using the U it has inferred for the entire physical universe. P(air) is very small in the context of the whole physical universe, but much larger in the context of the Earth. Every time the inference system is invoked it must assume a certain context size |U|, and there are no rigid mathematical rules for doing this. Rather, this parameter-setting task is a job for cognitive schema, which are learned by a host of processes in conjunction, including inference conducted with respect to the implicit local context generically associated with schema learning.

## Higher-Order Logical Relationships

The first-order logical relationships reviewed above are all relationships between basic terms. But the same sorts of probabilistic logical relationships may be seen to hold between more complex expressions involving variables (or variable-equivalent constructs like SatisfyingSets). ExtensionalImplication, for example, is a standard logical implication between predicates. In PLN-HOI we have the notion of a predicate similar to standard predicate logic as a function that maps arguments into truth-values. We have an Evaluation relationship so that, e.g., for the predicate isMale,

Evaluation isMale Ben_Goertzel <1>
Evaluation isMale Izabela_Goertzel <0>
Evaluation isMale Hermaphroditus <0.5>


So if we have the relationship

ExtensionalImplication isMale hasPenis <.99>


this means

isMale($X) implies hasPenis($X) <.99>


or in other words

ExtensionalImplication <.99>
Evaluation isMale $X Evaluation hasPenis$X


or

P( Evaluation hasPenis $X | Evaluation isMale$X) =.99


Note that we have introduced a new notational convention here: the names of variables that are arguments of Predicates are preceded by the $sign. This convention will be used throughout the book. Regarding the treatment of the non-crisp truth-values of the Evaluation relationships, the same considerations apply here as with Subset and Member relationships. Essentially we are doing fuzzy set theory here and may use the min(,) function between the Evaluation relationship strengths. As this example indicates, the semantics of higher-order PLN relationships thus basically boils down to the semantics of first-order PLN relationships. To make this observation formal we must introduce the SatisfyingSet operator. We define the SatisfyingSet of a Predicate as follows: the SatisfyingSet of a Predicate is the set whose members are the elements that satisfy the Predicate. Formally, that is: S = SatisfyingSet P  means (Member$X S).tv = (Evaluation P $X).tv  In PLN, generally speaking, one must consider not only Predicates that are explicitly embodied in Predicate objects, but also Predicates defined implicitly by relationship types; e.g., predicates like P($x) = Inheritance $x A  This means that relationships between relationships may be considered as a special case of relationships between predicates. In any case, given an individual Predicate h we can construct SatisfyingSet(h), and we can create an average over a whole set of Predicates h, B ${\displaystyle \in }$ SatisfyingSet(h) ${\displaystyle \rightarrow }$ A ${\displaystyle \in }$ SatisfyingSet(h) Thus, information about h(x) for various x ${\displaystyle \in }$ B and x ${\displaystyle \in }$ A and various Predicates h can be used to estimate the strengths of subset relationships between sets. Also note that in dealing with SatisfyingSets we will often have use for the Predicate NonEmpty, which returns 1 if its argument is nonempty and 0 if its argument is the empty set. For instance, Evaluation NonEmpty (SatisfyingSet (eats_bugs)) <1>  means that indeed, there is somebody or something out there who eats bugs. The main point of SatisfyingSets is that we can use them to map from higher-order into first-order. A SatisfyingSet maps Evaluation relationships into Member relationships, and hence has the side effect of mapping higher-order relations into ordinary first-order relations between sets. In other words, by introducing this one higher-order relationship (SatisfyingSet) as a primitive we can automatically get all other higher-order relationships as consequences. So using SatisfyingSets we don’t need to introduce special higher-order relationships into PLN at all. However, it turns out to be convenient to introduce them anyway, even though they are “just” shorthand for expressions using SatisfyingSets and first-order logical relationships. To understand the reduction of higher-order relations to first-order relations using SatisfyingSets, let R1 and R2 denote two (potentially different) relationship types and let X denote an Atom-valued variable, potentially restricted to some subclass of Atoms such as a particular term or relationship type. For example, we may construct the following higher-order relationship types:  ExtensionalImplication R1 A X R2 B X  equals Subset SatisfyingSet(R1 A X) SatisfyingSet(R2B X)  Implication R1 A X R2 B X  equals Inheritance SatisfyingSet(R1 A X) SatisfyingSet(R2 B X)  ExtensionalEquivalence R1 A X R2 B X  equals ExtensionalSimilarity SatisfyingSet(R1 A X) SatisfyingSet(R2 B X)  Equivalence R1 A X R2 B X  equals Similarity SatisfyingSet(R1 A X) SatisfyingSet(R2 B X)  Higher-order purely intensional symmetric and asymmetric logical relationships are omitted in the table, but may be defined analogously. To illustrate how these higher-order relations work, consider an example higher-order relationship, expressed in first-order logic notation as ${\displaystyle \forall }$ X (Member Ben X) ${\displaystyle \rightarrow }$(Inheritance scientists X) This comes out in PLN notation as: ExtensionalImplication Member Ben X Inheritance scientists X  (“if Ben is a member of the group X, then X must contain scientists”), or equivalently Subset SatisfyingSet (Member Ben) SatisfyingSet (Inheritance scientists)  (“the set of groups X that satisfies the constraint ‘MemberRelationship Ben X’ is a subset of the set of groups X that satisfies the constraint ‘Inheritance scientists X.’”) While the above examples have concerned single-variable relationships, the same concepts and formalism work for the multiple-variable case, via the mechanism of using a single list-valued variable to contain a list of component variables. ## N-ary Logical Relationships The next representational issue we will address here has to do with relationships that have more than one argument. We don’t just want to be able to say that ''cat inherits from animal, we want to be able to say that cats eat mice, that flies give diseases to people, and so forth. We want to express complex n-ary relations, and then reason on them. In PLN there are two ways to express an n-ary relation: using list relationships, and using higher-order functions. Each has its strengths and weaknesses, so the two are used in parallel. For instance, the list approach is often more natural for inferences using n-ary relationships between simple terms, whereas the higher-order function approach is often more natural for many aspects of inference involving complex Predicates. ### The List/Set Approach The List approach to representing n-ary relations is very simple. An n-ary relation f with the arguments x1, …, xn is represented as a Predicate, where Evaluation f (x1, …, xn)  So for instance, the relationship “Ben kicks Ken” becomes roughly Evaluation kicks (Ben, Ken)  This doesn’t take the temporal aspect of the statement into account, but we will ignore that for the moment (the issue will be taken up later on). In some cases one has a relationship that is symmetric with respect to its arguments. One way to represent this is to use a Set object for arguments. For instance, to say “A fuses with B” we may say Evaluation fuse {A, B}  where {A, B} is a Set. This kind of representation is particularly useful when one is dealing with a relationship with a large number of arguments, as often occurs with the processing of perceptual data. ### Curried Functions Another way to represent n-ary functions – in the spirit of Haskell rather than LISP – is using function currying. For example, a different representation of “Ben kicks Ken” is Evaluation (kick Ben) Ken  where in this case the interpretation is that (kick Ben) is a function that outputs a Predicate function that tells whether its argument is kicked by Ben or not. Strictly, of course, the kick in this example is not the same as the kick in the argument list example; a more correct notation would be, for instance, Evaluation kick_List (Ben, Ken) Evaluation (kick_curry Ben) Ken  In a practical PLN system these two functions will have to be represented by different predicates, with the equivalence relation. In a practical PLN system these two functions will have to be represented by different predicates, with the equivalence relation Equivalence Evaluation (kick_curry$x) $y Evaluation kick_List ($x, $y)  and/or one or more of the listification relations kick_List = listify kick_curry kick_curry = unlistify kick_List  stored in the system to allow conversion back and forth. Another representation is then Evaluation (kick_curry_2 Ken) Ben  which corresponds intuitively to the passive voice “Ken was kicked by Ben.” We then have the conceptual equivalences kick_curry = “kicks” kick_curry_2 = “is kicked by”  Note that the relation between kick_curry and kick_curry_2 is trivially representable using the C_combinator (note that in this book we use the notational convention that combinators are underlined) by kick_curry = C_kick_curry_2 Mathematical properties of Predicates are easily expressed in this notation. For instance, to say that the Predicate fuse is symmetric we need only use the higher-order relationship EquivalenceRelationship fuse C_fuse  or we could simply say Inheritance fuse symmetric  where the Predicate symmetric is defined by ExtensionalEquivalence Evaluation symmetric$X
Equivalence
$X C$X


# Chapter 3: Experiential Semantics

Abstract Chapter 3 is a brief chapter in which we discuss the conceptual interpretation of the terms used in PLN, according to the scheme we have deployed when utilizing PLN to carry out inferences regarding the experience of an embodied agent in a simulated world. This is what we call “experiential semantics.” The PLN mathematics may also be applied using different semantic assumptions, for instance in a logical theorem-proving context. But the development of PLN has been carried out primarily in the context of experiential semantics, and that will be our focus here.

## Introduction

Most of the material in this book is mathematical and formal rather than philosophical in nature. Ultimately, however, the mathematics of uncertain logic is only useful when incorporated into a practical context involving non-logical as well as logical aspects; and the integration of logic with non-logic necessarily requires the conceptual interpretation of logical entities.

The basic idea of experiential semantics is that the interpretation of PLN terms and relationships should almost always be made in terms of the observations made by a specific system in interacting with the world. (Some of these observations may, of course, be observations of the system itself.) The numerous examples given in later (and prior) sections regarding “individuals” such as people, cats and so forth, can’t really be properly interpreted without attention to this fact, and in particular to the experiential semantics of individuals to be presented here. What makes PLN’s experience-based semantics subtle, however, is that there are many PLN terms and relationships that don’t refer directly to anything in the world the PLN reasoning system is observing. But even for the most abstract relationships and concepts expressed in PLN, the semantics must ultimately be grounded in observations.

## Semantics of Observations

In experiential semantics, we are considering PLN as a reasoning system intended for usage by an embodied AI agent: one with perceptions and actions as well as cognitions. The first step toward concretizing this perspective is to define what we mean by observations.

While PLN is mostly about statements with probabilistic truth values, at the most basic semantic level it begins with Boolean truth-valued statements. We call these basic Boolean truth-valued statements “elementary observations.” Elementary observations may be positive or negative. A positive observation is one that occurred; a negative observation is one that did not occur. Elementary observations have the property of unity: each positive elementary observation occurs once. This occurrence may be effectively instantaneous or it may be spread over a long period of time. But each elementary observation occurs once and is associated with one particular set of time-points.

In the experiential-semantics approach, PLN’s probabilistic statements may all ultimately be interpreted as statements about sets of elementary observations. In set-theoretic language, these elementary observations are “atoms” and PLN Concept terms are sets built up from these “atoms.” However we won’t use that terminology much here, since in PLN the term Atom is used to refer to any PLN term or relationship. Instead, we will call these “elementary terms.” In an experiential-semantics approach to PLN, these are the basic terms out of which the other PLN terms and relationships are built.

For example, in the context of an AI system with a camera eye sensor, an elementary observation might be the observation A defined by

A = "the color of the pixel at location (100, 105) is blue at time 2:13:22PM on Tuesda January 6, 2004."

Each elementary observation may be said to contain a certain number of bits of information; for instance, an observation of a color pixel contains more bits of information than an observation of a black-and-white pixel.

### Inference on Elementary Observations

Having defined elementary observations, one may wish to draw implication relationships between them. For instance, if we define the elementary observations

''B = "blue was observed by me at time 2:13:22PM on Tuesday January 6, 2004"

C = "blue was observed by me on Tuesday January 6, 2004"

then we may observe that

A implies B

B implies C

A implies C

However, the semantics of this sort of implication is somewhat subtle. Since each elementary observation occurs only once, there is no statistical basis on which to create implications between them. To interpret implications between elementary observations, one has to look across multiple “possible universes,” and observe that for instance, in any possible universe in which A holds, C also holds. This is a valid form of implication, but it’s a subtle one and occurs as a later development in PLN semantics, rather than at a foundational level.

### Inference on Sets of Elementary Observations

Next, sets of elementary observations may be formed and their unions and intersections may be found, and on this basis probabilistic logical relationships between these sets may be constructed. For instance, if

X = the set of all observations of dark blue Y = the set of all observations of blue

then it’s easy to assign values to ${\displaystyle P(X|Y)}$ and ${\displaystyle P(Y|X)}$ based on experience.

P(X|Y) = the percentage of observations of blue that are also observations of dark blue that are also observations of dark blue

P(Y|X) = 1, because all observations of dark blue are observations of blue oberavtions of blue.

Probabilistic inference on sets of observations becomes interesting because, in real-world intelligence, each reasoning system collects far more observations than it can retain or efficiently access. Thus it may retain the fact that

P(dark blue | blue) = .3

without retaining the specific examples on which this observation is founded. The existence of “ungrounded” probabilistic relationships such as this leads to the need for probabilistic inference using methods like the PLN rules.

In this context we may introduce the notion of “intension,” considered broadly as “the set of attributes possessed by a term.” This is usually discussed in contrast to “extension,” which is considered as the elements in the set denoted by a term. In the absence of ungrounded probabilities, inference on sets of observations can be purely extensional. However, if a reasoning system has lost information about the elements of a set of observations, but still knows other sets the set belongs to, or various conditional probabilities relating the set to other sets, then it may reason about the set using this indirect information rather than the (forgotten) members of the set – and this reasoning may be considered “intensional.” This is a relatively simple case of intensionality, as compared to intensionality among individuals and sets thereof, which will be discussed below.

## Semantics of Individuals

While the basic semantics of PLN is founded on observations, most of the concrete PLN examples we will give in these pages involve individual entities (people, animals, countries, and so forth) rather than directly involving observations. The focus on individuals in this text reflects the level at which linguistic discourse generally operates, and shouldn’t be taken as a reflection of the level of applicability of PLN: PLN is as applicable to perception-and-action-level elementary observations as it is to abstract inferences about individuals and categories of individuals. However, reasoning about individuals is obviously a very important aspect of commonsense reasoning; and so, in this section, we give an explicit treatment of the semantics of individuals. Reasoning regarding individuals is a somewhat subtler issue than reasoning regarding observations, because the notion of an “individual” is not really a fundamental concept in mathematical logic.

Conventional approaches to formalizing commonsense inference tend to confuse things by taking individuals as logical atoms. In fact, in the human mind or the mind of any AI system with a genuine comprehension of the world, individuals are complex cognitive constructs. Observations are much more naturally taken as logical atoms, from a mathematical and philosophical and cognitive-science point of view. However, from a practical commonsense reasoning point of view, if one takes elementary observations as logical atoms, then inference regarding individuals can easily become horribly complex, because the representation of a pragmatically interesting individual in terms of elementary observations is generally extremely complex. PLN works around this problem by synthesizing individual-level and observation-level inference in a way that allows individual-level inference to occur based implicitly on observation-level semantics. This is a subtle point that is easy to miss when looking at practical PLN inference examples, in which individual- and observation-level semantics are freely intermixed in a consistent way. This free intermixture is only possible because the conceptual foundations of PLN have been set up in a proper way.

Let’s consider an example of an individual: the orange cat named Stripedog[1] who is sitting near me (Ben Goertzel) as I write these words. What is Stripedog? In PLN terms, Stripedog is first of all a complex predicate formed from elementary observations. Given a set of elementary observations, my mind can evaluate whether this set of observations is indicative of Stripedog’s presence or not. It does so by evaluating a certain predicate whose input argument is a set of elementary observations and whose output is a truth value indicating the “degree of Stripedogness” of the observation set.

At this point we may introduce the notion of the “usefulness” of an argument for a predicate, which will be important later on. If a predicate P is applied to an observation set S, and an observation O lies in S, then we say that O is important for (P, S) if removing O from S would alter the strength of the truth value of P as applied to S. Otherwise O is unimportant for (P, S). We will say that an observation set S is an identifier for P if one of two conditions holds:

Positive identifier: S contains no elements that are unimportant for (P, S), or

Negative identifier: P applied to S gives a value of 0

Of course, in addition to the observational model of Stripedog mentioned above, I also have an abstract model of Stripedog in my mind. According to this abstract model, Stripedog is a certain pattern of arrangement of molecules. The individual molecules arranged in the Stripedogish pattern are constantly disappearing and being replaced, but the overall pattern of arrangement is retained. This abstract model of Stripedog exists in my mind because I have an abstract model of the everyday physical world in my mind, and I have some (largely ungrounded) implications that tell me that when a Stripedoggish elementary observation set is presented to me, this implies that a Stripedoggish pattern of arrangement of molecules is existent in the physical world that I hypothesize to be around me.

The Stripedog-recognizing predicate, call it FStripedog, has a SatisfyingSet that we may denote simply as stripedog, defined by

ExtensionalEquivalence
Member $X stripedog AND Evaluation Fstrippeddog$X
Evalutaion isIdentifier ($Fstripdog)  The predicate isIdentifier(S,P) returns True if and only if the set S is an identifier for the predicate P, in the sense defined above. This set, Stripedog, is the set of all observation sets that are identifiers for the individual cat named Stripedog, with a fuzzy truth value function defined by the extent to which an observation set is identified as being an observation of Stripedog. ### Properties of Individuals Now, there may be many predicates that imply and/or are implied by FStripedog to various degrees. For instance there’s a predicate Fliving_being that says true whenever a living being is observed; clearly Implication FStripedog Fliving_being holds with a strength near 1 (a strength 1 so far, based on direct observation, since Stripedog has not yet been seen dead; but a strength <1 based on inference since it’s inferred that it’s not impossible, though unlikely, to see him dead). And, 'Implication FStripedog Forange<.8,.99> holds as well – the .8 being because when Stripedog is seen at night, he doesn’t look particularly orange. Those predicates that are probabilistically implied by the Stripedog-defining predicate are what we call properties of Stripedog. Note that if 'Implication F G holds, then Inheritance (SatisfyingSet F) (SatisfyingSet G) holds. So properties of Stripedog correspond to observation sets that include observations of Stripedog plus other, non-Stripedoggish observations. Notes 1. In case you’re curious, “Stripedog” is a colloquialism for “badger” -- a word that Ben Goertzel’s son Zebulon discovered in the Redwall books by Brian Jacques, and decided was an excellent name for a cute little orange kitten-cat. ## Experiential Semantics and Term Probabilities Another conceptual issue that arises in PLN related to experiential semantics is the use of term probabilities. It is reasonable to doubt whether a construct such as P(cat) or P(Stripedog) makes any common sense – as opposed to conditional probabilities denoting the probabilities of these entities in some particular contexts. In fact we believe there is a strong cognitive reason for a commonsense reasoning engine to use default term probabilities, and one that ties in with experiential semantics and merits explicit articulation. The key point is that the default probability of a term represents the probability relative to the context of the entire universe as intellectually understood by the reasoning system. This would be nebulous to define and of extremely limited utility. Rather, a term probability should represent the probability of the class denoted by the term relative to the organism’s (direct and indirect) experience. An analogy to everyday human experience may be worthwhile. Outside of formal contexts like science and mathematics, human organisms carry out commonsense reasoning within the default context of their everyday embodied life. So for instance in our everyday thinking we assume as a default that cats are more common than three-eared wombats. Though we can override this if the specific context calls for it. So, we conjecture, the term probability of “cat” in a typical human mind is shorthand for “the term probability of cat in the default context of my everyday life” – but it is not represented anything like this; rather, the “everyday life” context is left implicit. Formalistically, we could summarize the above discussion by saying that: The default term probability of X is the weighted average of the probability of X across all contexts C, where each context C is weighted by its importance to the organism. In this sense, default term probabilities become more heuristic than context-specific probabilities. And they also require concepts outside PLN for their definition, relying on the embedding of PLN in some broader embodied cognition framework such as the NCE. The conceptual reason why this kind of default node probability is useful is that doing all reasoning contextually is expensive, as there are so many contexts. So as an approximation, assuming a default experiential context is very useful. But the subtlety is that for an organism that can read, speak, listen, and so forth, the “everyday experiential context” needs to go beyond what is directly experienced with the senses. ## Conclusion In this brief chapter, beginning with elementary observations, we have built up to individuals and their properties. The semantic, conceptual notions presented here need not be invoked explicitly when reviewing the bulk of the material in this book, which concerns the mathematics of uncertain truth value estimation in PLN. However, when interpreting examples involving terms with names like “Ben” and “cat”, it is important to remember that in the context of the reasoning carried out by an embodied agent, such terms are not elementary indecomposables but rather complex constructs built up in a subtle way from a large body of elementary observations. # Chapter 4: Indefinite Truth Values Abstract In this chapter we develop a new approach to quantifying uncertainty via a hybridization of Walley’s theory of imprecise probabilities and Bayesian credible intervals. This “indefinite probability” approach provides a general method for calculating the “weight-of-evidence” underlying the conclusions of uncertain inferences. Moreover, both Walley’s imprecise beta-binomial model and standard Bayesian inference can be viewed mathematically as special cases of the more general indefinite probability model. ## Introduction One of the major issues with probability theory as standardly utilized involves the very quantification of the uncertainty associated with statements that serve as premises or conclusions of inference. Using a single number to quantify the uncertainty of a statement is often not sufficient, a point made very eloquently by Wang (Wang 2004), who argues in detail that the standard Bayesian approach does not offer any generally viable way to assess or reason about the “second-order uncertainty” involved in a given probability assignment. Probability theory provides richer mechanisms than this: one may assign a probability distribution to a statement, instead of a single probability value. But what if one doesn’t have the data to fill in a probability distribution in detail? What is the (probabilistically) best approach to take in the case where a single number is not enough but the available data doesn’t provide detailed distributional information? Current probability theory does not address this issue adequately. Yet this is a critical question if one wants to apply probability theory in a general intelligence context. In short, one needs methods of quantifying uncertainty at an intermediate level of detail between single probability numbers and fully known probability distributions. This is what we mean by the question: What should an uncertain truth-value be, so that a general intelligence may use it for pragmatic reasoning? ## From Imprecise Probabilities to Indefinite Probabilities Walley’s (Walley 1991) theory of imprecise probabilities seeks to address this issue, via defining interval probabilities, with interpretations in terms of families of probability distributions. The idea of interval probabilities was originally introduced by Keynes (Keynes 1921, 2004), but Walley’s version is more rigorous, grounded in the theory of envelopes of probability distributions. Walley’s intervals, so-called “imprecise probabilities,” are satisfyingly natural and consistent in the way they handle uncertain and incomplete information. However, in spite of a fair amount of attention over the years, this line of research has not yet been developed to the point of yielding robustly applicable mathematics. Using a parametrized envelope of (beta-distribution) priors rather than assuming a single prior as would be typical in the Bayesian approach, Walley (Walley 1991, 1996) concludes that it is plausible to represent probabilities as intervals of the form. ${\displaystyle \left[{\begin{array}{c}{\dfrac {m}{n+K}},{\dfrac {m=k}{n=k}}\end{array}}\right]}$ In this formula n represents the total number of observations, m represents the number of positive observations, and k is a parameter that Walley calls s and derives as a parameter of the beta distribution. Walley calls this parameter the learning parameter, while we will refer to it as the lookahead parameter. Note that the width of the interval of probabilities is inversely related to the number of observations n, so that the more evidence one has, the narrower the interval. The parameter k determines how rapidly this narrowing occurs. An interval of this sort is what Walley calls an “imprecise probability.” Walley’s approach comes along with a host of elegant mathematics including a Generalized Bayes’ Theorem. However it is not the only approach to interval probabilities. For instance, one alternative is Weichselberger’s (Weichselberger, 2003) axiomatic approach, which works with sets of probabilities of the form [L, U] and implies that Walley’s generalization of Bayes’ rule is not the correct one. One practical issue with using interval probabilities like Walley’s or Weichselberger’s in the context of probabilistic inference rules (such as those used in PLN) is the pessimism implicit in interval arithmetic. If one takes traditional probabilistic calculations and simplistically replaces the probabilities with intervals, then one finds that the intervals rapidly expand to [0, 1]. This fact simply reflects the fact that the intervals represent “worst case” bounds. This same problem also affects Walley’s and Weichselberger’s more sophisticated approaches, and other approaches in the imprecise probabilities literature. The indefinite probabilities approach presented here circumvents these practical problems by utilizing interval probabilities that have a different sort of semantics – closely related to, but not the same as, those of Walley’s interval probabilities. Indefinite probabilities, as we consider them here, are represented by quadruples of the form <[L, U], b, k>– thus, they contain two additional numbers beyond the [L, U] interval truth values proposed by Keynes, and one number beyond the <(L, U], k> formalism proposed by Walley. The semantics involved in assigning such a truth value to a statement S is, roughly, “I assign a probability of b to the hypothesis that, after I have observed k more pieces of evidence, the truth value I assign to S will lie in the interval [L, U].” In the practical examples presented here we will hold k constant and thus will deal with truth value triples <(L, U], b>. The inclusion of the value b, which defines the credibility level according to which [L, U] is a credible interval (for hypothesized future assignments of the probability of S, after observing k more pieces of evidence), is what allows our intervals to generally remain narrower than those produced by existing imprecise probability approaches. If b=1, then our approach essentially reduces to imprecise probabilities, and in pragmatic inference contexts tends to produce intervals [L, U] that approach [0, 1]. The use of b<1 allows the inferential production of narrower intervals, which are more useful in a real-world inference context. In practice, to execute inferences using indefinite probabilities we make heuristic distributional assumptions, assuming a “second-order” distribution which has [L, U] as a (100*b)% credible interval, and then “first-order” distributions whose means are drawn from the second-order distribution. These distributions are to be viewed as heuristic approximations intended to estimate unknown probability values existing in hypothetical future situations. The utility of the indefinite probability approach may be dependent on the appropriateness of the particular distributional assumptions to the given application situation. But in practice we have found that a handful of distributional forms seem to suffice to cover commonsense inferences (beta and bimodal forms seem good enough for nearly all cases; and here we will give only examples covering the beta distribution case). Because the semantics of indefinite probabilities is different from that of ordinary probabilities, or imprecise probabilities, or for example NARS truth values, it is not possible to say objectively that any one of these approaches is “better” than the other one, as a mathematical formalism. Each approach is better than the others at mathematically embodying its own conceptual assumptions. From an AGI perspective, the value of an approach to quantifying uncertainty lies in its usefulness when integrated with a pragmatic probabilistic reasoning engine. While complicated and dependent on many factors, this is nevertheless the sort of evaluation that we consider most meaningful. Section 4.3 deals with the conceptual foundations of indefinite probabilities, clarifying their semantics in the context of Bayesian and frequentist philosophies of probability. Section 4.4 outlines the pragmatic computational method we use for doing probabilistic and heuristic inference using indefinite probabilities. ## The Semantics of Uncertainty The main goal of this chapter is to present indefinite probabilities as a pragmatic tool for uncertain inference, oriented toward utilization in AGI systems. Before getting practical, however, we will pause in this section to discuss the conceptual, semantic foundations of the “indefinite probability” notion. In the course of developing the indefinite probabilities approach, we found that the thorniest aspects lay not in the mathematics or software implementation, but rather in the conceptual interpretation of the truth values and their roles in inference. In the philosophy of probability, there are two main approaches to interpreting the meaning of probability values, commonly labeled frequentist and Bayesian (Stanford Encyclopedia of Philosophy 2003). There are many shades of meaning to each interpretation, but the essential difference is easy to understand. The frequentist approach holds that a probability should be interpreted as the limit of the relative frequency of an event-category, calculated over a series of events as the length of the series tends to infinity. The subjectivist or Bayesian approach holds that a probability should be interpreted as the degree of belief in a statement, held by some observer; or in other words, as an estimate of how strongly an observer believes the evidence available to him supports the statement in question. Early proponents of the subjectivist view were Ramsey (Ramsey 1931) and de Finetti (de Finetti 1974-75), who argued that for an individual to display self-consistent betting behavior they would need to assess degrees of belief according to the laws of probability theory. More recently Cox’s theorem (Cox 1946) and related mathematics (Hardy 2002) have come into prominence as providing a rigorous foundation for subjectivist probability. Roughly speaking, this mathematical work shows that if the observer assessing subjective probabilities is to be logically consistent, then their plausibility estimates must obey the standard rules of probability. From a philosophy-of-AI point of view, neither the frequentist nor the subjectivist interpretations, as commonly presented, is fully satisfactory. However, for reasons to be briefly explained here, we find the subjectivist interpretation more acceptable, and will consider indefinite probabilities within a subjectivist context, utilizing relative frequency calculations for pragmatic purposes but giving them an explicitly subjectivist rather than frequentist interpretation. The frequentist interpretation is conceptually problematic in that it assigns probabilities only in terms of limits of sequences, not in terms of finite amounts of data. Furthermore, it has well-known difficulties with the assignment of probabilities to unique events that are not readily thought of as elements of ensembles. For instance, what was the probability, in 1999, of the statement S holding that “A great depression will be brought about by the Y2K problem”? Yes, this probability can be cast in terms of relative frequencies in various ways. For instance, one can define it as a relative frequency across a set of hypothetical “possible worlds”: across all possible worlds similar to our own, in how many of them did the Y2K problem bring about a great depression? But it’s not particularly natural to assume that this is what an intelligence must do in order to assign a probability to S. It would be absurd to claim that, in order to assign a probability to S, an intelligence must explicitly reason in terms of an ensemble of possible worlds. Rather, the claim must be that whatever reasoning a mind does to evaluate the probability of S may be implicitly interpreted in terms of possible worlds. This is not completely senseless, but is a bit of an irritating conceptual stretch. The subjectivist approach, on the other hand, is normally conceptually founded either on rational betting behaviors or on Cox’s theorem and its generalizations, both of which are somewhat idealistic. No intelligent agent operating within a plausible amount of resources can embody fully self-consistent betting behavior in complex situations. The irrationality of human betting behavior is well known; to an extent this is due to emotional reasons, but there are also practical limitations on the complexity of the situation in which any finite mind can figure out the correct betting strategy. And similarly, it is too much to expect any severely resource-constrained intelligence to be fully self-consistent in the sense that the assumptions of Cox’s theorem require. In order to use Cox’s theorem to justify the use of probability theory by practical intelligences, it seems to us, one would need to take another step beyond Cox, and argue that if an AI system is going to have a “mostly sensible” measure of plausibility (i.e., if its deviation from Cox’s axioms are not too great), then its intrinsic plausibility measure must be similar to probability. We consider this to be a viable line of argument, but will pursue this point in another paper – to enlarge on such matters here would take us too far afield. Walley’s approach to representing uncertainty is based explicitly on a Bayesian, subjectivist interpretation; though whether his mathematics has an alternate frequentist interpretation is something he has not explored, to our knowledge. Similarly, our approach here is to take a subjectivist perspective on the foundational semantics of indefinite probabilities (although we don’t consider this critical to our approach; quite likely it could be given a frequentist interpretation as well). Within our basic subjectivist interpretation, however, we will frequently utilize relative frequency calculations when convenient for pragmatic reasoning. This is conceptually consistent because within the subjectivist perspective there is still a role for relative frequency calculations, so long as they are properly interpreted. Specifically, when handling a conditional probability P(A|B), it may be the case that there is a decomposition B=B1+...+Bn so that the Bi are mutually exclusive and equiprobable, and each of P(A|Bi) is either 0 or 1. In this case the laws of probability tell us P(A|B) = P(A|B1) P(B1| B) + ... + P(A|Bn) P(Bn|B) = (P(A|B1) + ... +P(A|Bn))/n, which is exactly a relative frequency. So, in the case of statements that are decomposable in this sense, the Bayesian interpretation implies a relative frequency based interpretation (but not a “frequentist” interpretation in the classical sense). For decomposable statements, plausibility values may be regarded as the means of probability distributions, where the distributions may be derived via subsampling (sampling subsets C of {B1,...,Bn}, calculating P(A|C) for each subset, and taking the distribution of these values; as in the statistical technique known as bootstrapping). In the case of the “Y2K” statement and other similar statements regarding unique instances, one option is to think about decomposability across possible worlds, which is conceptually controversial. ## Indefinite Probability We concur with the subjectivist maxim that a probability can usefully be interpreted as an estimate of the plausibility of a statement, made by some observer. However, we suggest introducing into this notion a more careful consideration of the role of evidence in the assessment of plausibility. We introduce a distinction that we feel is critical, between • the ordinary (or “definite”) plausibility of a statement, interpreted as the degree to which the evidence already (directly or indirectly) collected by a particular observer supports the statement. • the “indefinite plausibility” of a statement, interpreted as the degree to which the observer believes that the overall body of evidence potentially available to him supports the statement. The indefinite plausibility is related to the ordinary plausibility, but also takes into account the potentially limited nature of the store of evidence collected by the observer at a given point in time. While the ordinary plausibility is effectively represented as a single number, the indefinite plausibility is more usefully represented in a more complex form. We suggest to represent an indefinite plausibility as a quadruple <(L, U], b, k>, which when attached to a statement S has the semantics “I assign an ordinary plausibility of b to the statement that ‘Once k more items of evidence are collected, the ordinary plausibility of the statement S will lie in the interval [L, U]’.” Note that indefinite plausibility is thus defined as “second-order plausibility” – a plausibility of a plausibility. As we shall see in later sections of the paper, for most computational purposes it seems acceptable to leave the parameter k in the background, assuming it is the same for both the premises and the conclusion of an inference. So in the following we will speak mainly of indefinite probabilities as <(L, U], b> triples, for sake of simplicity. The possibility does exist, however, that in future work inference algorithms will be designed that utilize k explicitly. Now, suppose we buy the Bayesian argument that ordinary plausibility is best represented in terms of probability. Then it follows that indefinite plausibility is best represented in terms of second-order probability; i.e., as “I assign probability b to the statement that ‘Once k more items of evidence have been collected, the probability of the truth of S based on this evidence will lie in the interval [L, U]’.” ### An Interpretation in Terms of Betting Behavior To justify the above definition of indefinite probability more formally, one approach is to revert to betting arguments of the type made by de Finetti in his work on the foundations of probability. As will be expounded below, for computational purposes we have taken a pragmatic frequentist approach based on underlying distributional assumptions. However, for purposes of conceptual clarity, a more subjectivist de Finetti style justification is nevertheless of interest. So, in this subsection we will describe a “betting scenario” that leads naturally to a definition of indefinite probabilities. Suppose we have a category C of discrete events; e.g., a set of tosses of a certain coin, which has heads on one side and tails on the other. Next, suppose we have a predicate S, which is either True or False (Boolean values) for each event within the above event-category C. For example, if C is a set of tosses of a certain coin, then S could be the event “Heads.” S is a function from events into Boolean values. If we have an agent A, and the agent A has observed the evaluation of S on n different events, then we will say that n is the amount of evidence that A has observed regarding S; or we will say that A has made n observations regarding S. Now consider a situation with three agents: the House, the Gambler, and the Meta-gambler. As the name indicates, House is going to run a gambling operation, involving generating repeated events in category C, and proposing bets regarding the outcome of future events in C. More interestingly, House is also going to propose bets to Meta-gambler regarding the behavior of Gambler. Specifically, suppose House behaves as follows. After Gambler makes n observations regarding S, House offers Gambler the opportunity to make what we‘ll call a “de Finetti” type bet regarding the outcome of the next observation of S. That is, House offers Gambler the opportunity: You must set the price of a promise to pay$1 if the next observation of S comes out ::True, and $0 if it does not. You must commit that I will be able to choose to either buy ::such a promise from you at the price you have set, or to require you to buy such a ::promise from me. In other words: you set the odds, but I decide which side of the bet ::will be yours. Assuming Gambler does not want to lose money, the price Gambler sets in such a bet is the “operational subjective probability” that Gambler assigns that the next observation of S will come out True. As an aside, House might also offer Gambler the opportunity to bet on sequences of observations; e.g., it might offer similar “de Finetti” price-setting opportunities regarding predicates like “The next 5 observations of S made will be in the ordered pattern (True, True, True, False, True).” In this case, things become interesting if we suppose Gambler thinks that: For each sequence Z of {True, False} values emerging from repeated observation of S, any permutation of Z has the same (operational subjective) probability as Z. Then, Gambler thinks that the series of observations of S is “exchangeable,” which means intuitively that S’s subjective probability estimates are really estimates of the “underlying probability of S being true on a random occasion.” Various mathematical conclusions follow from the assumption that Gambler does not want to lose money, combined with the assumption that Gambler believes in exchangeability. Next, let’s bring Meta-gambler into the picture. Suppose that House, Gambler and Meta-gambler have all together been watching n observations of S. Now, House is going to offer Meta-gambler a special opportunity. Namely, he is going to bring Meta-gambler into the back room for a period of time. During this period of time, House and Gambler will be partaking in a gambling process involving the predicate S. Specifically, while Meta-gambler is in the back room, House is going to show Gambler k new observations of S. Then, after the k’th observation, House is going to come drag Meta-gambler out of the back room, away from the pleasures of the flesh and back to the place where gambling on S occurs. House then offers Gambler the opportunity to set the price of yet another de Finetti style bet on yet another observation of S. Before Gambler gets to set his price, though, Meta-gambler is going to be given the opportunity of placing a bet regarding what price Gambler is going to set. Specifically, House is going to allow Meta-gambler to set the price of a de Finetti style bet on a proposition of Meta-gambler’s choice, of the form: Q = “Gambler is going to bet an amount p that lies in the interval [L, U]” For instance Meta-gambler might propose Let Q be the proposition that Gambler is going to bet an amount lying in [.4, .6] on ::this next observation of S. I’ll set at 30 cents the price of a promise defined as ::follows: To pay$1 if Q comes out True, and \$0 if it does not. I will commit that you ::will be able to choose either to buy such a promise from me at this price, or to ::require me to buy such a promise from you.

I.e., Meta-Gambler sets the price corresponding to Q, but House gets to determine which side of the bet to take. Let us denote the price set by Meta-gambler as b; and let us assume that Meta-gambler does not want to lose money. Then, b is Meta-gambler’s subjective probability assigned to the statement that:

“Gambler’s subjective probability for the next observation of S being True lies in ::[L, U].”

But, recall from earlier that the indefinite probability <[L, U], b, k> attached to S means that:

“The estimated odds are b that after k more observations of S, the estimated ::probability of S will lie in [L, U].”

or in other words

“[L, U] is a b-level credible interval for the estimated probability of S after k ::more observations.”

In the context of an AI system reasoning using indefinite probabilities, there is no explicit separation between the Gambler and the Meta-gambler; the same AI system makes both levels of estimate. But this is of course not problematic, so long as the two components (first-order probability estimation and b-estimation) are carried out separately.

One might argue that this formalization in terms of betting behavior doesn’t really add anything practical to the indefinite probabilities framework as already formulated. At minimum, however, it does make the relationship between indefinite probabilities and the classical subjective interpretation of probabilities quite clear.

### A Pragmatic Frequentist Interpretation

Next, it is not hard to see how the above-presented interpretation of an indefinite plausibility can be provided with an alternate justification in relative frequency terms, in the case where one has a statement S that is decomposable in the sense described above. Suppose that, based on a certain finite amount of evidence about the frequency of a statement S, one wants to guess what one’s frequency estimate will be once one has seen a lot more evidence. This guessing process will result in a probability distribution across frequency estimates – which may itself be interpreted as a frequency via a “possible worlds” interpretation. One may think about “the frequency, averaged across all possible worlds, that we live in a world in which the observed frequency of S after k more observations will lie in interval I.” So, then, one may interpret <[L, U], b, N> as meaning “b is the frequency of possible worlds in which the observed frequency of S, after I’ve gathered k more pieces of evidence, will lie in the interval [L, U].”

This interpretation is not as conceptually compelling as the betting-based interpretation given above – because bets are real things, whereas these fictitious possible worlds are a bit slipperier. However, we make use of this frequency-based interpretation of indefinite probabilities in the practical computational implementation of indefinite probability presented in the following sections – without, of course, sacrificing the general Bayesian interpretation of the indefinite probability approach. In the end, we consider the various interpretations of probability to be in the main complementary rather than contradictory, providing different perspectives on the same very useful mathematics.

Moving on, then: To adopt a pragmatic frequency-based interpretation of the second-order plausibility in the definition of indefinite plausibility, we interpret “I assign probability b to the statement that ‘Once k more items of evidence are collected, the probability of the truth of S based on this evidence will lie in the interval [L, U]’” to mean “b is the frequency, across all possible worlds in which I have gathered k more items of evidence about S, of worlds in which the statement ‘the estimated probability of S lies in the interval [L, U]’ is true.”This frequency-based interpretation allows us to talk about a probability distribution consisting of probabilities assigned to values of ‘the estimated probability of S,’ evaluated across various possible worlds. This probability distribution is what, in the later sections of the paper, we call the “second-order distribution.” For calculational purposes, we assume a particular distributional form for this second-order distribution.

Next, for the purpose of computational implementation, we make the heuristic assumption that the statement S under consideration is decomposable, so that in each possible world, “the estimated probability of S may be interpreted as the mean of a probability distribution. For calculational purposes, in our current implementation we assume a particular distributional form for these probability distributions, which we refer to as “the first-order distributions.”

The adoption of a frequency-based interpretation for the second-order plausibility seems hard to avoid if one wants to do practical calculations using the indefinite probabilities approach. On the other hand, the adoption of a frequency-based interpretation for the first-order plausibilities is an avoidable convenience, which is appropriate only in some situations. We will discuss below how the process of reasoning using indefinite probabilities can be simplified, at the cost of decreased robustness, in cases where decomposability of the first order probabilities is not a plausible assumption.

So, to summarize, in order to make the indefinite probabilities approach computationally tractable, we begin by restricting attention to some particular family D of probability distributions. Then, we interpret an interval probability attached to a statement as an assertion that: “There is probability b that the subjective probability of the statement, after I have made k more observations, will appear to be drawn from a distribution with a mean in this interval.”

Then, finally, given this semantics and a logical inference rule, one can ask questions such as: “If each of the premises of my inference corresponds to some interval, so that there is probability b that after k more observations the distribution governing the premise will appear to have a mean in that interval, then what is an interval so that b of the family of distributions of the conclusion have means lying in that interval?” We may then give this final interval the interpretation that, after k more observations, there is a probability b that the conclusion of the inference will appear to lie in this final interval. (Note that, as mentioned above, the parameter k essentially “cancels out” during inference, so that one doesn’t need to explicitly account for it during most inference operations, so long as one is willing to assume it is the same in the premises and the conclusion.)

In essence, this strategy merges the idea of imprecise probabilities with the Bayesian concept of credible intervals; thus the name “indefinite probabilities” (“definite” having the meaning of “precise,” but also the meaning of “contained within specific boundaries” – Walley’s probabilities are contained within specific boundaries, whereas ours are not).

As hinted above, however, the above descriptions mask the complexity of the actual truth-value objects. In the indefinite probabilities approach, in practice, each IndefiniteTruthValue object is also endowed with three additional parameters:

• An indicator of whether [L, U] should be considered as a symmetric or asymmetric credible interval.
• A family of “second-order” distributions, used to govern the second-order plausibilities described above.
• A family of “first-order” distributions, used to govern the first-order plausibilities described above.

Combined with these additional parameters, each truth-value object essentially provides a compact representation of a single second-order probability distribution with a particular, complex structure.

## Truth-Value Conversions

In our current implementation, we usually use <[L, U], b, k> IndefiniteTruthValues for inference. For other purposes however, it is necessary to convert these truth values into SimpleTruthValues, in either (s,w) or (s,n) form. We now derive a conversion formula for translating indefinite truth values into simple truth values. In order to carry out the derivation additional assumptions must be made, which is why the formula derived here must be considered “heuristic” from the point of view of applications. When the underlying distributional assumptions apply, the formula is exact, but these assumptions may not always be realistic.

### Calculation of Approximate Conversion Formulas

To derive conversion formulas we assume that the distributions underlying the means within the [L, U] intervals of the indefinite truth values are beta distributions. Due to the conjugacy of the beta and binomial distributions, this means we can model these means as corresponding to Bernoulli trials.

In order to derive approximate formulas, we first consider the problem “backwards”: Given b, n, and k, how can we derive [L, U] from an assumption of an underlying Bernoulli process with unknown probability p? We then reverse the process to obtain an approach for deriving n given b, k and [L, U].

Theorem: Suppose there were x successes in the first n trials of a binomial process with an unknown fixed probability p for success. Suppose further that the prior probability density f(p=a) is uniform. Then the probability that there will be ${\displaystyle (x+X)}$ successes in the first ${\displaystyle (n+k)}$ trials is given by

${\displaystyle {\dfrac {(n+1){\dbinom {k}{X}}{\dbinom {n}{x}}}{(k+n+1){\dbinom {k+n}{X+x}}}}}$

where ${\displaystyle {\dbinom {}{\quad }}{}}$ indicates the binomial coefficent.

Proof: The Probability

${\displaystyle p(y\quad {\text{successes in}}\quad n+k\quad {\text{trials}}|x\quad {\text{successes in n trials}})}$

is the same as

${\displaystyle P(y-x\quad {\text{successes in k trials}})}$.

Letting ${\displaystyle X=y-x}$ and assuming probability p for success on each trial then this probability would be given by

${\displaystyle {\dbinom {k}{X}}P^{X}(1-P)^{k-X}}$.

The probability densities for p can be found Baye;s theorem, which states Probabilistic Logic Networks

${\displaystyle f{\Bigg (}p=a|s={\dfrac {x}{n}}{\Bigg )}={\dfrac {f{\Bigg (}s={\dfrac {x}{n}}|p=a{\Bigg )}f(p=a)}{f{\Bigg (}s={\dfrac {x}{n}}{\Bigg )}}}}$ .............(1)

Here ${\displaystyle f()}$ denotes the appropraite probability density.

Now ${\displaystyle f(p=a)=1}$ since n is fixed, ${\displaystyle f{\Bigg (}s={\dfrac {x}{n}}{\Bigg )}=f(x\quad {\text{success in n trials}}),}$ so

${\displaystyle f{\Bigg (}s={\dfrac {x}{n}}{\Bigg )}=\int _{O}^{1}{\dbinom {n}{x}}p^{x}(1-p)^{n-x}dp={\dfrac {1}{n+1}}}$ .............(2)

Hence, ${\displaystyle p(X\quad {\text{successes in k trials}})}$

${\displaystyle =(n+1)\int 0^{1}{\dbinom {k}{X}}{\dbinom {n}{x}}p^{X+x}(1-P)^{k+1-(X+x)}dp}$

${\displaystyle ={\dfrac {(n+1){\dbinom {k}{X}}{\dbinom {n}{x}}}{(k+n+1){\dbinom {k+n}{X+x}}}}}$.................................................................(3)

The theorem gives a distribution based on n, k and x; and then, applying b, we can find a symmetric credible interval [L, U] about s=x/n based on this distribution.

Due to small deviations arising from integer approximations, given L, U, b and k, the reverse process is somewhat trickier. We now outline two approximate inverse procedures. We first exhibit a heuristic algorithmic approach. From the results of this heuristic approach we then develop an approximate inverse function.

#### Heuristic Algorithmic Approach

1. Let ${\displaystyle L_{1}={\dfrac {[(n+k)L+0.5]}{n+k}},{\text{and}}\quad U_{1}={\dfrac {[(n+K)U+0.5]}{n+k}}}$

2. Calculate ${\displaystyle b_{1}={\dfrac {n+1}{k+n+1}}\\\sum _{m=l}^{m=l+1}{\dfrac {{\binom {k}{m}}{\binom {n}{x}}}{\binom {k+n}{m+x}}}}$, where

${\displaystyle x={\dfrac {[n(L+U)+0.5]}{2}}}$, ${\displaystyle l=[(n+k)L+0.5]}$, and
${\displaystyle r=(n+k)(U_{1}-L_{1})}$

3. Form the function ${\displaystyle v(n)=d[<[L1,U1],b1>,<[L,U],b>]}$, where ${\displaystyle d[a,b]}$ is the standard Eculidean distance from a to b.

4. Find the value of n that minimizes the function ${\displaystyle v(n)}$.

Aside from small deviations arising from integer approximations, n depends inversely on the width U-L of the interval ${\displaystyle [L,U]}$. To find the n-value in step 4 we initially perform a search by setting n=2j for a sequence of j values, until we obtain a b value that indicates we have surpassed the correct n value. We then perform a binary search between this maximum value N and N/2. We thus guarantee that the actual algorithm is of order O(log n).

#### Approximate Inverse Function Approach

As an alternate and faster approach to finding n, we develop a function of L, U, k, and b that provides a reasonable approximation for n. We begin by plotting the cumulative probabilities given by equation (3) for various values of n, by following the first two steps of the heuristic approach above. Aside from small deviations caused by the discrete nature of the cumulative distribution functions, each graph can be approximately modeled by a function of the form

${\displaystyle {\text{prob}}=1-b=1-Ae^{Bn}}$. Inverting, we obtain${\displaystyle n=-{\dfrac {\ln {\left\lbrack {\frac {b}{A}}\right\rbrack }}{B}}}$.

For simplicity, we model the dependence of the coefficients A and B upon the values of L, U, and k, linearly. From the data we gathered this appears to be a reasonable assumption, though we have not yet derived an analysis of the error of these approximations. We will use the notation ${\displaystyle A=A_{1}U+A_{2}}$, ${\displaystyle B=B_{1}U+B_{2}}$, ${\displaystyle A_{i}=A_{i1}L+A_{i2}}$, ${\displaystyle B_{i}=B_{i1}L+B_{i2}}$, ${\displaystyle A_{ij}=A_{ij1}k+A_{ij2}}$, and ${\displaystyle B_{ij}=B_{ij1}k+B_{ij2}}$, where ${\displaystyle 1\leq i,j,k\leq 2}$. Putting everything together we end up with

${\displaystyle A=(A_{11}L+A_{12})U}$${\displaystyle +(A_{21}L+A_{22})}$

${\displaystyle =[(A_{111}k+A_{112})L+A_{121}k+A_{122}]U+[(A_{211}k+A_{21})L+A_{221}k+A_{222}]}$

and

${\displaystyle B=(B_{11}L+B_{12})U+(B_{21}L+B_{22})}$ ${\displaystyle =[(B_{111}k+B_{112})L+B_{121}k+B_{122}]U+[(B_{211}K+B_{212})L+B_{221}k+B_{222}]}$

Finding the values of the coefficients Aij and B yields the following values:

 A111= -0.00875486 A112= -2.35064019 A121= 0.002463011 A122= -0.220372781 A211= 0.010727656 A212= 2.803020516 A221= -0.003647227 A222= 0.437068392 B111= 0.003032946 B112= -0.399778839 B121= -0.004302594 B122= 0.930153781 B211= 0.002803518 B212= -0.593689012 B221= -0.000265616 B222= -0.071902027

Observing that the dependence upon k is relatively negligible compared to the dependence upon L, U, and b, we can alternatively eliminate the k-dependence and use instead fixed values for A11=-2.922447948, A12=-0.072074201, A21=3.422902859, A22=0.252879708, B11=-0.261903438, B12=0.716893418, B21=-0.39831749, and B22=-0.107482534.

### Further Development of Indefinite Probabilities

In this chapter we have presented the basic idea of indefinite truth values. The purpose of the indefinite truth value idea, of course, lies in its utilization in inference, which is left for later chapters. But our hope is that in this chapter we have conveyed the essential semantics of indefinite probabilities, which is utilized, along the mathematics, in the integration of indefinite probabilities with inference rules. Some of the inferential applications of indefinite probabilities we encounter in later chapters will be fairly straightforward, such as their propagation through deductive and Bayesian inference. Others will be subtler, such as their application in the context of intensional or quantifier inference. In all cases, however, we have found the indefinite probability notion useful as a summary measure of truth value in a pragmatic inference context. In some cases, of course, a summary approximation won’t do and one actually needs to retain one or more full probability distributions rather than just a few numbers giving a rough indication. But in reality one can’t always use a full representation of this nature due to restrictions on data, memory, and processing power; and thus we have placed indefinite probabilities in a central role within PLN. As compared with simpler summary truth values such as single probability numbers or (probability, weight of evidence) pairs, they seem to provide a better compromise between compactness and accuracy.

# Chapter 5: First-Order Extensional Inference-Rules and Strength Formulas

Abstract In this chapter we launch into the “meat” of PLN: the specific PLN inference rules, and the corresponding truth-value formulas used to determine the strength of the conclusion of an inference rule from the strengths of the premises. Inference rules and corresponding truth-value strength formulas comprise a large topic; in this chapter we deal with a particular sub-case of the problem: first-order extensional inference.

## Introduction

Recall that first-order inference refers to inference on relationships between terms (rather than on predicates, or relationships between relationships), and that extensional inference refers to inference that treats terms as denoting sets with members (as opposed to intensional inference, which treats terms as denoting entities with properties). These first-order extensional rules and truth-value formulas turn out to be the core PLN rules truth-value formulas, in the sense that most of the rules and formulas for handling higher-order and/or intensional inference and/or weight of evidence are derived as re-interpretations of the first-order extensional rules and associated strength formulas. Higher-order inference, intensional inference, and weight of evidence formulas are handled separately in later chapters.

## Independence-Assumption-Based PLN Deduction

In this section we present one version of the PLN strength formula for first-order extensional deduction (abbreviated to “deduction” in the remainder of the chapter). First we give some conceptual discussion related to this inference formula; then we provide the algebraic formula. The formula itself is quite simple; however it is important to fully understand the concepts underlying it, because it embodies simplifying assumptions that introduce errors in some cases. We will also be reviewing an alternate strength formula for first-order extensional deduction, in a later section of this chapter, which mitigates these errors in certain circumstances.

Conceptually, the situation handled by the first-order extensional deduction formula is depicted in the following Venn diagram:

First-order extensional deduction, as well as the related inference forms we call induction and abduction in PLN, may be cast in the form: Given information about the size of some regions in the Venn diagram, make guesses about the size of other regions.

### Conceptual Underpinnings of PLN Deduction

In this subsection, as a preliminary to presenting the PLN inference formulas in detail, we will discuss the concepts underlying PLN in a more abstract way, using simple inference examples to demonstrate what it is we really mean by “probabilistic deduction”. This conceptual view is not needed for the actual calculations involved in PLN, but it’s essential to understanding the semantics underlying these calculations, and is also used in the proof of the PLN deduction formula.

Let’s consider some simple examples regarding Subset relationships. Supposing, in the above diagram, we know

Subset A B <.5>
Subset B C <.5>


What conclusions can we draw from these two relationships? What we want is to derive a relation of the form

Subset A C <tv>


from the two given premises. When we do this however, we are necessarily doing probabilistic, estimative inference, not direct truth-value evaluation.

To see why, suppose Bis a set with two elements; e.g.,

B = {x1,x2}

so that

Member x1 B <1>
Member x2 B <1>


Let’s consider two of the many possible cases that might underlie the above Sub-set relationships:

The problem is that, just given the two premises

Subset A B <.5>
Subset B C <.5>


we don’t know which of the two above cases holds – or if in fact it’s a completely different situation underlying the Subset relationships. Different possible situations, underlying the same pair of Subset relationships, may result in completely different truth-values <tv> for the conclusion

Subset A C <tv>


So no exact computation of the truth-value of the conclusion is possible. Instead, all that’s possible is an estimate of the truth-value of the latter, obtained by averaging over possible situations consistent with the two given premises. The PLN deduction strength formula to be given below is merely an efficient way of carrying out this averaging, in an approximate way.

We will derive theorems giving compact formulas for the expected (average) truth-values to be obtained as the answer to the above deduction problem. We approach this averaging process in two ways. First, we derive inference formulas for an “independence-based PLN,” in which we average over all possible sets satisfying given premises. Later, we introduce a more general and more realistic “concept-geometry” approach. In the concept-geometry approach, we restrict attention to only those sets having particular shapes. The idea here is that concepts tend to be better approximated by these shapes than by random sets.

One important point regarding PLN inference is that the evidence sets used to compute the truth-values of two premises need not be overlapping. We can freely combine two relationships that were derived based on different sets of observations. This is important because many real-world cases fit this description. For instance, when a person reasons

Inheritance gulls birds
Inheritance birds fly
|-
Inheritance gulls fly


it may well be the case that the two premises were formed based on different sets of evidence. The gulls that were observed to be birds may well be different from the birds that were observed to fly. A simple example of non-overlapping evidence, related to our above Subset example with the sets A, B, and C, is:

 Case 3(non-overlapping evidence sets)  A= {x1, x4, x7, x8} B= {x1, x2, x3, x5, x6} C= {x2, x3, x4}  In this case, direct truth-value evaluation yields  Subset A C <.25>  Now, suppose that truth-value estimates for the two relations Subset A B <.5> Subset B C <.5>  were obtained as follows: Subset A B <.5>  was derived by observing only {x1, x7}, whereas Subset B C <.5>  was derived by observing only {x2, x6}

In Case 3, the two premise Subsets were derived based on completely different sets of evidence. But the inference rules don’t care; they can make their estimates anyway. We will present this “not caring” in a formal way a little later when we discuss the weight of evidence rules for first-order inference.

Now let’s step through the basic logic by which PLN deals with these examples of inference involving Subsets, with all their concomitant subcases. Basically, as noted above, what PLN does is to estimate the outcome of an average over all possible cases. The PLN formulas carry out this estimation in a generally plausible way, based on the information available.

More formally, what does this mean? First we’ll present a slight simplification, then un-simplify it in two stages. In addition to these simplifications, we will also assume a strength-only representation of truth-values, deferring consideration of more complex truth-value types until later.

In the simplified version what PLN says in this example is, basically: Let’s assume a universe U that is a finite size Usize. Let V denote the set of all triples of sets {A, B, C} in this universe, such that

Subset A B <.5>
Subset B C <.5>


holds. For each triple in V, we can compute the value S so that

Subset A C <S>


The average of all these strength-estimates is the estimated truth-value strength of the Subset. Note that this average includes cases where ${\displaystyle A\cap B}$ and ${\displaystyle B\cap C}$ have no overlap at all, cases where ${\displaystyle A\cap B}$ and ${\displaystyle B\cap C}$ are identical, cases where A and C are identical – all possible cases given the assumed finite universe U and the constraints posed by the premise Subsets. If the two premise Subsets are drawn from different evidence sets, this doesn’t matter – the degree of evidence-set overlap is just one among many unknown properties of the evidence sets underlying the premise truth-values.

Now, how is this picture a simplification? The first way is that we haven’t introduced all the potentially available information. We may have knowledge about the truth-values of A, B, and C, as well as about the truth-values of the Subset relationships. In fact, this is the usual case. Suppose that SA, SB, and SC are the strengths of the truth-values of the Atoms A, B, and C. In that case, we can redefine the set V specified above; we can define it as the set of all triples of sets {A, B, C} so that

|A|= SA USize
|B|= SB USize
|c|= SC USize
Subset A B <.5>
Subset A C <.5>


hold. Here Usize is the “universe size,” the size of the total space of entities under consideration. We can then compute the truth-value of

Subset A C <tv>


as the average of the estimates obtained for all triples in V.

Next, the other way in which the above picture is a simplification is that it assumes the strength values of the premises (and the strength values of the Atoms A, B, and C) are exactly known. In fact, these will usually be estimated values; and if we’re using indefinite truth-values, distributional truth-values, or confidence-bearing truth-values, knowledge about the “estimation error” may be available. In these cases, we are not simply forming a set V as above. Instead, we are looking at a probability distribution V over all triples {A, B, C} where A, B, and C are subsets of U, and the quantities

|A| (determined from SA= A. TruthValue. strength)
|B| (determined from SB= B. TruthValue. strength)
|C| (determined from SC= C. TruthValue. strength)
(Subset A B). TruthValue.strength
(Subset B C). TruthValue.strength


are drawn from the probability distributions specified by the given truth-values. We will deal with this uncertainty below by doing a sensitivity analysis of the PLN inference formulas, estimating for each formula the error that may ensue from uncertain inputs, and discounting the weight of evidence associated with the conclusion based on this estimated error.

Finally, one technical point that comes up in PLN is that the quantitative results of truth-value estimates depend on the finite universe size Usize that is assumed. This parameter is also called the “context size.” Basically, the smaller Usize is, the more speculative the inferences are. In the example given above, the minimum usable Usize is ${\displaystyle Usize=|U|=|A|+|B|+|C|}$. This parameter setting is good if one wants to do speculative inductive or abductive inference. If one wants to minimize error, at cost of also minimizing creativity, one should set Usize as large as possible. Using the PLN formulas, there is no additional computational cost in assuming a large Usize; the choice of a Usize value can be based purely on the desired inferential adventurousness. The semantics of the universal set will be discussed further in a later subsection.

### The Independence-Assumption-Based Deduction Rule

Now we proceed to give a heuristic derivation of one of the two truth-value strength formulas commonly associated with the PLN deduction rule. In this inference rule, we want to compute the “strength” SAC of the relationship

Subset A C <SAC


which we interpret as

SAC ${\displaystyle =P(C|A)={\dfrac {P(A\cap C)}{P(A)}}}$

given the data:

SAB ${\displaystyle =P(B|A)}$

SBC ${\displaystyle =P(C|B)}$

SA ${\displaystyle =P(A)}$

SB ${\displaystyle =P(B)}$

SC ${\displaystyle =P(C)}$

Essentially, that means we want to guess the size of the Venn diagram region ${\displaystyle A\cap C}$ given information about the sizes of the other regions A,B,C,${\displaystyle A\cap B}$,${\displaystyle B\cap C}$.

As illustrated in the above example, in the following discussion we will sometimes use the notation:

• sij’s are strengths of Subset relationships (e.g., Subset ij<sij>)
• si’s are strengths of terms (i.e., i<si>)

This notation is handy for the presentation of algebraic relationships involving Atom strengths, a situation that often arises with PLN. We will also use Nij and Ni to denote relationship and term “counts” respectively, and dij and dj to denote relationship and term “weights of evidence.”

Whenever the set of values $\displaystyle \{S_{A},S_{B},S_{C},S_{A}_{B},S_{B}_{C}\}$ is consistent (i.e., when it corresponds to some possible sets A, B and C) then the PLN deduction strength formula becomes

Equation 2.1

${\displaystyle S_{AC}=S_{AB}+{\frac {(1-S_{AB})(S_{C}S_{B}S_{BC})}{1-S_{B}}}}$

Here we will give a relatively simple heuristic proof of this formula, and then (more tediously) a fully rigorous demonstration.

As we shall see a little later, formulas for inductive and abductive reasoning involve similar problems, where one is given some conditional and absolute probabilities and needs to derive other, indirectly related ones. For induction, one starts with SBA, SBC, SA, SB, SC and wishes to derive SAC. For abduction, one starts with SAB, SCB, SA, SB, SC and wants to derive SAC. The inference formulas involving similarity relations may be similarly formulated.

#### Heuristic Derivation of the Independence-Assumption-Based Deduction Rule

The heuristic derivation that we’ll give here relies on a heuristic independence assumption. The rigorous derivation given afterward replaces the appeal to an independence assumption with an averaging over all possible worlds consistent with the constraints given in the premises. But of course, even the rigorous proof embodies some a prior assumptions. It assumes that the only constraints are the ones implicitly posed by the truth-values of the premise terms and relationships, and that every possible world consistent with these truth-values is equally likely. If there is knowledge of probabilistic dependency, this constitutes a bias on the space of possible worlds, which renders the assumption of unbiased independence invalid. Knowledge of dependency can be taken into account by modifying the inference formula, in a way that will be discussed below. The danger is where there is a dependency that is unknown; in this case the results of inference will not be accurate, an unavoidable problem.

Assume we’re given ${\displaystyle P(A),P(B),P(C),P(B|A),P(C|B),}$ defined relative to some universal set U. We want to derive ${\displaystyle P(C|A)}$; i.e., the formula for ${\displaystyle P(C|A)=S_{AC}}$, which was cited above. We begin the derivation by observing that

${\displaystyle P(C|A)={\dfrac {P(C\cap A)}{P(A)}}}$

Because ${\displaystyle P(A)}$ is assumed given, we may focus on finding ${\displaystyle P(C\cap A)}$ in terms of the other given quantities.

We Know

${\displaystyle P(C\cap A)=P(B)P((C\cap A)|B)+P(U-B)p((C\cap A)|U-B)}$

This follows because in general

${\displaystyle P(B)P(X|B)+P(U-B)P(X|U-B)={\dfrac {P(B)P(X\cap B)}{P(B)}}+{\dfrac {P(U-B)P(X\cap (U-B)}{P(U-B)}}}$

${\displaystyle =P(X\cap B)+P(X\cap (U-B))}$

${\displaystyle =P(X\cap B)+[P(X)-P(X\cap B)]=P(x)}$

So we can say, for instance,

${\displaystyle P{\big (}(C\cap A)|B){\big )}={\dfrac {P(C\cap A\cap B)}{P(B)}}}$

${\displaystyle ={\dfrac {P{\big (}(C\cap B)\cap (A\cap B){\big )}}{P(B)}}}$

${\displaystyle ={\dfrac {{\big |}(C\cap B)\cap (A\cap B){\big |}}{|B|}}}$

Now, we can’t go further than this without making an independence assumption. But if we assume C and A are independent (in both B and ${\displaystyle U-B}$), we can simplify these terms.

To introduce the independence assumption heuristically, we will introduce a “bag full of balls” problem. Consider a bag with N balls in it, b black ones and w white ones. We are going to pick n balls out of it, one after another, and we want to know the chance ${\displaystyle p(k)}$ that k of them are black. The solution to this problem is known to be the hypergeometric distribution. Specifically,

${\displaystyle p(k)={\binom {b}{k}}{\binom {N-B}{n-k}}\diagup {\binom {N}{n}}}$

The mean of this distribution is:

${\displaystyle mean={\dfrac {bn}{N}}}$

How does this bag full of balls relate to our situation? We may say:

• Let our bag be the set B, so ${\displaystyle N=|B|}$
• Let the black balls in the bag correspond to the elements of the set ${\displaystyle A\cap B}$, so that ${\displaystyle b=|A\cap B|}$
• The white balls then correspond to elements of ${\displaystyle B-A}$
• The n balls we are picking are the elements of the set ${\displaystyle C\cap B}$, so ${\displaystyle n=|C\cap B|}$

This probabilistic “bag full of balls” model embodies the assumption that A and C are totally independent and uncorrelated, so that once B and A are fixed, the chance of a particular subset of size ${\displaystyle |C\cap B|}$ lying in ${\displaystyle A\cap B}$ is the same as the chance of that element of C lying in a randomly chosen set of size ${\displaystyle |A\cap B|}$. This yields the formula:

${\displaystyle {\dfrac {bn}{N}}={\dfrac {|A\cap B||C\cap B|}{|B|}}}$

which is an estimate of the size

${\displaystyle |C\cap B)\cap (A\cap B)|}$

So if we assume that A and C are independent inside B, we can say

${\displaystyle P{\big (}(C\cap A)|B{\big )}={\dfrac {{\big |}(C\cap B)\cap (A\cap B)}{|B|}}}$ ${\displaystyle ={\dfrac {|A\cap B||C\cap B|}{|B|^{2}}}=P(A|B)P(C|B)}$

Similarly, for the second term, by simply replacing B with U-B and then doing some algebra, we find

${\displaystyle P{\big (}(C\cap A)|(U-B){\big )}=P{\big (}A|(U-B)P(C|(U-B){\big )}}$

${\displaystyle ={\dfrac {|A\cap (U-B)||C\cap (U-B)|}{|U-B|^{2}}}}$

${\displaystyle ={\dfrac {[P(A)-P(A\cap B)][P(C)-P(A\cap B)]}{[1-P(B)]^{2}}}}$

So altogether, we find

${\displaystyle P(C\cap A)=P(B)P{\big (}(C\cap A)|B{\big )}+P(U-B)P{\big (}(C\cap A)|(U-B){\big )}}$ ${\displaystyle =P(B)P(A|B)P(C|B)+{\dfrac {[1-P(B)][P(A)-P(A\cap B)][P(C)-P(C\cap B)]}{[1-P(B)]^{2}}}}$

and hence

${\displaystyle P(C|A)={\dfrac {P(C\cap A)}{P(A)}}}$

${\displaystyle ={\dfrac {P(A|B)P(C|B)P(B)}{P(A)}}+{\dfrac {[1-P(A\cap B)][P(C)-P(C\cap B)]}{1-P(B)}}}$ ${\displaystyle =P(B|A)P(C|B)+{\dfrac {[1-P(B|A)][P(C)-P(C\cap B)]}{1-P(B)}}}$ ${\displaystyle =P(B|A)P(C|B)+{\dfrac {[1-P(B|A)][P(C)-P(B)P(C|B)]}{1-P(B)}}}$

Note that in the above we have used Bayes' rule to convert

${\displaystyle P(A|B)={\dfrac {P(B|A)P(A)}{P(B)}}}$

We now have the PLN deduction formula expressed in terms of conditional and absolute probabilities. In terms of our above-introduced notation for term and relationship strengths, we may translate this into:

Equation (2.1)

${\displaystyle S_{AC}=S_{AB}S_{BC}+{\dfrac {(1-S_{AB})(S_{C}-S_{C}S_{BC})}{1-S_{B}}}}$

which is the formula mentioned above.

#### Deduction and Second-Order Probability

We now give a formal proof of the independence-assumption-based PLN deduction formula. While the proof involves a lot of technical details that aren’t conceptually critical, there is one aspect of the proof that sheds some light on the more philosophical aspects of PLN theory: this is its use of second-order probabilities.

We first define what it means for a set of probabilities to be consistent with each other. Note that, given specific values for ${\displaystyle S_{A}=P(A)}$ and ${\displaystyle S_{B}=P(B)}$, not all values in the interval [0,1] for ${\displaystyle S_{(}AB)=P(B|A)}$ necessarily make sense. For example,

if ${\displaystyle P(A)=0.6=P(B)}$, then the minimum value for ${\displaystyle P(A\cap B)=0.2}$ so that the minimum value for ${\displaystyle (P(B|A)}$ is ${\displaystyle 0.2/0.6=1/3}$.

Definition:We say that the ordered triple ${\displaystyle (S_{A},S_{B},S_{AB})}$ of probability values ${\displaystyle S_{A}=P(A),S_{B}=P(B)}$, and ${\displaystyle S_{AB}=P(B|A)}$ is consistent if the probabilities satisfy the following condition:

${\displaystyle max{\Bigg (}0,{\dfrac {S_{A}+S_{B}-1}{S_{A}}}{\Bigg )}\leq S_{AB}\leq min{\Bigg (}1,{\dfrac {S_{B}}{S_{A}}}{\Bigg )}}$.

Definition: The ordered triple of subsets (A, B, C) for which the ordered triples(SA, SB, SAB)and(SB, SC, SBC) are both consistent we shall call fully consistent subset-triples.

We will prove:

Theorem 1 (PLN Deduction Formula)

Let U denote a set with |U| elements. Let Sub(m) denote the set of subsets of U containing m elements. Let (A, B, C) be a fully consistent subset-triple. Further suppose that each of the values SA, SB, SC, SAB, and SBC divides evenly into |U|. Next, define

${\displaystyle f(x)=P{\bigg [}P(C|A)=x|A\in Sub(|U|S_{A}),B\in Sub(|U|S_{B}),C\in Sub(|U|S_{C}),P(B|A)=S_{AB},P(C|B)=S_{BC}{\bigg ]}}$

Then, where E[ ] denotes the expected value (mean), we have

${\displaystyle E[f(x)]=S_{AC}=S_{AB}S_{BC}+{\dfrac {(1-S_{AB})(S_{C}-S_{B}S_{BC})}{1-S_{B}}}}$

This theorem looks at the space of all finite universal sets U (all “sample spaces”), and with each U it looks at all possible ways of selecting subsets A, B, and C out of U. It assumes that this size of U is given, and that certain absolute and conditional probabilities regarding A, B, and C are given. Namely, it assumes that P(A), P(B), P(C), P(B|A) and P(C|B) are given, but not P(C|A). For each U, it then looks at the average over all A, B, and C satisfying the given probabilities, and asks: If we average across all the pertinent (A, B, C) triples, what will P(C|A) come out to, on average? Clearly, P(C|A) may come out differently for different sets A, B, and C satisfying the assumed probabilities. But some values are more likely than others, and we’re looking for the mean of the distribution of P(C|A) values over the space of acceptable (A, B, C) triples. This is a bit different from the usual elementary-probability theorems in that it’s a second-order probability: we’re not looking at the probability of an event, but rather the mean of a probability over a certain set of sample spaces (sample spaces satisfying the initially given probabilistic relationships).

In spite of the abstractness induced by the use of second-order probabilities, the proof is not particularly difficult. Essentially, after one sets up the average over pertinent sample spaces and does some algebra, one arrives at the same sort of hypergeometric distribution problem that was used in the heuristic derivation in the main text. The difference, however, is that in this proof there is no ad-hoc independence assumption; rather, the independence comes out of the averaging process automatically because on average, approximate probabilistic independence between terms is the rule, not the exception.

Proof of Theorem 1 (PLN Deduction Formula)

The way the theorem is stated, we start with a set U of |U| elements, and we look at the set of all subset-triples {A, B, C} fulfilling the given constraints. That is, we are looking at subset-triples (A,B,C) for which the Predicate constr defined by

${\displaystyle constr(A,B,C)=A\in Sub(|U|S_{A})AND\quad B\in Sub(U|S_{B})AND\quad C\in Sub(|U|S_{C})AND\quad P(B|A)=S_{AB}AND\quad P(C|B)=S_{BC}}$

evaluates to True.

Over this set of subset-triples, we’re computing the average of P(C|A). That is, we’re computing

${\displaystyle E[f(X)]={\dfrac {1}{M}}\sum \nolimits _{(A,B,C):constr(A,B,C)}P(C|A)}$

Where M denotes the number of triples (A,B,C) so that constr(A,B,C). Following the lines of our heuristic derivation of the formula, we may split this into two sums as follows:

${\displaystyle E{\big [}f(x){\big ]}={\dfrac {1}{M}}\sum \nolimits _{(A,B,C):Constr(A,B,C)}{\dfrac {P(C\cap A)}{P(A)}}}$

${\displaystyle ={\dfrac {1}{M}}{\Bigg [}\sum \nolimits _{(A,B,C):Constr(A,B,C)}{\dfrac {P{\big (}(C\cap A)|B{\big )}P(B)}{P(A)}}+\sum \nolimits _{(A,B,C):Constr(A,B,C)}{\dfrac {P{\big (}(C\cap A)|(U-B){\big )}p(U-B)}{P(A)}}{\Bigg ]}}$

After going this far, the heuristic derivation then used probabilistic independence to split up ${\displaystyle P((C\cap A)|B)}$ and ${\displaystyle P(C\cap A)|(U-B)}$ into two simpler terms apiece. Following that, the rest of the heuristic derivation was a series of straightforward algebraic substitutions. Our task here will be to more rigorously justify the use of the independence assumption. Here we will not make an independence assumption; rather, the independence will be implicit in the algebra of the summations that are “summations over all possible sets consistent with the given constraints.” We will use formal methods analogous to the heuristic independence assumption, to reduce these sums into simple formulas consistent with the heuristic derivation.

We will discuss only the first sum here; the other one follows similarly by substituting ${\displaystyle U-B}$ for B. For the first sum we need to justify the following series of steps:

${\displaystyle ={\dfrac {1}{M}}\sum \nolimits _{(A,B,C):Constr(A,B,C)}{\dfrac {P{\big (}(C\cap A)|B{\big )}P(B)}{P(A)}}}$

${\displaystyle ={\dfrac {1}{M}}\sum \nolimits _{(A,B,C):Constr(A,B,C)}{\dfrac {P(C\cap A\cap B)}{P(A)}}}$

${\displaystyle ={\dfrac {1}{M}}\sum \nolimits _{(A,B,C):Constr(A,B,C)}{\dfrac {P{\big (}(C\cap B)\cap (A\cap B){\big )}}{P(A)}}}$

${\displaystyle ={\dfrac {1}{M}}\sum \nolimits _{(A,B,C):Constr(A,B,C)}{\dfrac {P(C\cap B)P(A\cap B)}{P(B)P(A)}}}$

${\displaystyle ={\dfrac {1}{M}}\sum \nolimits _{(A,B,C):Constr(A,B,C)}{\dfrac {P(C|B)P(A|B)P(B)}{P(A)}}}$

The final step is the elegant one. It follows because, over the space of all triples (A,B,C) so that constr(A,B,C) holds, the quantities ${\displaystyle P(C|B)}$ and ${\displaystyle P(B|A)}$ are constant by assumption. So they may be taken out of the summation, which has exactly M terms.

The difficult step to justify is the third one, where we transform ${\displaystyle P{\big (}(C\cap B)\cap (A\cap B){\big )}}$ into ${\displaystyle {\dfrac {P(C\cap B)P(A\cap B)}{P(B)}}}$.

This is where the algebra of the summations is used to give the effect of an independence assumption. To justify this third transformation, it suffices to show that

${\displaystyle \sum \nolimits _{(A,B,C):constr(A,B,C)}{\dfrac {P{\big (}(C\cap B)\cap (A\cap B){\big )}-{\dfrac {P(C\cap B)P(A\cap B)}{P(B)}}}{P(A)}}=0}$

We will do this by rewriting the sum as

${\displaystyle \sum \nolimits _{B\in Sub(|U|sB)}{\Bigg [}\sum \nolimits _{(A,C):constr(A,B,C)}{\bigg [}{\dfrac {P{\big (}(C\cap B)\cap (A\cap B){\big )}-{\dfrac {P(C\cap B)P(A\cap B)}{P(B)}}}{P(A)}}{\bigg ]}{\Bigg ]}=0}$

Note that the term ${\displaystyle P(A)}$ is constant for all (A,B,C) satisfying constr(A,B,C), so it may be taken out of the summation and effectively removed from consideration, yielding

${\displaystyle \sum \nolimits _{B\in Sub(|U|sB)}{\Bigg [}\sum \nolimits _{(A,C):constr(A,B,C)}{\bigg [}P{\big (}(C\cap B)\cap (A\cap B){\big )}-{\dfrac {P(C\cap B)P(A\cap B)}{P(B)}}{\bigg ]}{\Bigg ]}=0}$

We will show that this is true by showing that the inner summation itself is always zero; i.e.,that for a fixed B,

(Equation 2.2)

${\displaystyle \sum \nolimits _{(A,C):constr(A,B,C)}{\bigg [}P{\big (}(C\cap B)\cap CA\cap B){\big )}-{\dfrac {P(C\cap B)P(A\cap B)}{P(B)}}{\bigg ]}=0}$

In order to demonstrate Equation 2.2, we will now recast the indices of summation in a different-looking but equivalent form, changing the constraint to one that makes more sense in the context of a fixed B.

Given a fixed B', let’s say that a pair of sets (A1, C1) is B-relevant if it satisfies the relationships

${\displaystyle A1=A\cap B}$

${\displaystyle C1=C\cap B}$

for some triple (A,B,C) satisfying constr(A,B,C).

We now observe that the pair (A1, C1) is B-relevant if it satisfies the constraint predicate

constr1${\displaystyle (A1,C1;B)=A1\subseteq B\ {\text{AND}}\ C1\subseteq B\ {\text{AND}}\ |A1|=|A|{\text{s}}_{AB}\ {\text{AND}}\ |C1|=|B|{\text{s}}_{BC}}$

The constraint for ${\displaystyle |C1|}$ comes from the term in the former constraint constr stating

${\displaystyle P(C|B)=S_{BC}}$

For, we may reason

${\displaystyle P(C|B)={\dfrac {P(C\cap B)}{P(B)}}={\dfrac {P(C1)}{P(B)}}=S_{BC}}$

${\displaystyle P(C1)=P(B)S_{BC}}$

${\displaystyle |C1|=|B|S_{BC}}$

Similarly, to get the constraint for |A1|, we observe that

${\displaystyle P(B|A)=dfrac{P(A\cap B)}{P(A)}=dfrac{P(A1)}{P(A)}=S_{AB}}$

so that

${\displaystyle |A|=|A|S_{AB}}$

Given a fixed B, and a specific B-relevant pair (A1,C1), let EQ(A1, C1;B) denote the set of pairs (A,C) for which constr(A,B,C)and

${\displaystyle A1=A\cap B}$

${\displaystyle C1=C\cap B}$

Now we will recast Equation 2.2 above in terms of A1 and C1. Equation 2.2 is equivalent to

${\displaystyle \sum \nolimits _{(A1,C1):constr1(A1,C1;B)}\sum \nolimits _{(A,C)\in SubEQ(A1,C1;B)}{\bigg [}P(A1\cap C1)-{\dfrac {P(A1)P(C1)}{P(B)}}{\bigg ]}=0}$

Because the inner sum is the same for each pair ${\displaystyle (A,C)\in SubEQ(A1,C1;B)}$, and because ${\displaystyle K\equiv |EQ(A1,C1)|}$ is the same for each B-relevant pair (A1,C1), we can rewrite this as

${\displaystyle \sum \nolimits _{(A1,C1):constr1(A1,C1;B)}{\Bigg [}K{\bigg [}P(A1\cap C1)-{\dfrac {P(A1)P(C1)}{P(B)}}{\bigg ]}{\Bigg ]}=0}$

or just

${\displaystyle \sum \nolimits _{(A1,C1):constr1(A1,C1;B)}{\bigg [}P(A1\cap C1)-{\dfrac {P(A1)P(C1)}{P(B)}}{\bigg ]}=0}$

We now have a somewhat simpler mathematics problem. We have a finite set B, with two subsets A1 and C1 of known sizes. Other than their sizes, nothing about A1 and C1 is known. We need to sum a certain quantity

${\displaystyle Q\equiv P(A1\cap C1)-{\dfrac {P(A1)P(C1)}{P(B)}}}$

over all possibilities for A1 and C1 with the given fixed sizes. We want to show this sum comes out to zero. This is equivalent to showing that the average of Q is zero, over all A1 and C1 with the given fixed sizes.

Now, the second term of Q is constant with respect to averaging over pairs (A1,C1), because

${\displaystyle {\dfrac {P(A1)P(C1)}{P(B)}}={\dfrac {|A1||C1|}{|B||U|}}}$

which is independent of what the sets A1 and C1 are, assuming they have fixed sizes. So the average of the second term is simply${\displaystyle {\dfrac {|A1||C1|}{|B||U|}}}$. We will rewrite Equation 2.3 as

${\displaystyle L\equiv {\dfrac {1}{M1}}\sum \nolimits _{A1\subseteq B:|A1|=|A|S_{AB}}{\bigg [}{\dfrac {1}{M2}}\sum \nolimits _{C1:constr1(A1,C1;B)}P(A1\cap C1){\bigg ]}={\dfrac {|A||C1|}{|B||U|}}}$..(2.4)

where

• M1 is the number of A1’s that serve as part of a B-relevant pair (A1,C1) with any C1; i.e., the number of terms in the outer sum;
• M2 is the number of C1’s that serve as part of a B-relevant pair (A1,C1) for some specific A1; i.e., the number of terms in the inner sum.

Note that M2 is independent of the particular set A1 under consideration; and that ${\displaystyle M=M1M2}$.

To show (2.4), it suffices to show that for a fixed A1,

${\displaystyle {\dfrac {1}{M2}}\sum \nolimits _{C1:constr1(A1,C1;B)}P(A1\cap C1)={\dfrac {|A1||C1|}{|B||U|}}}$... (2.5)

To see why this suffices observe that, by the definition in (2.4), if (2.5) held, we’d have

${\displaystyle L\equiv {\dfrac {1}{M1}}\sum \nolimits _{A1\subseteq B:|A1|=|A|S_{AB}}{\dfrac {|A1||C1|}{|B||U|}}}$

But the expression inside the sum is constant for all A1 being summed over (because they all have ${\displaystyle |A1|=|A|S_{AB}}$ and the number of terms in the sum is M1, so that on the assumption of (2.4) we obtain the result

${\displaystyle L\equiv {\dfrac {|A1||C1|}{|B||U|}}}$

which is what (2.3) states.

So, our task now is to show (2.5). Toward this end we will use an equivalent form of (2.4); namely

${\displaystyle {\dfrac {1}{M2}}\sum \nolimits _{C1:constr1(A1,C1;B)}|A1\cap C1|={\dfrac {|A1||C1|}{|B|}}}$....... (2.6)

the equivalence follows from ${\displaystyle P(A1\cap C1)=|A1\cap C1|/|U|}$. To show (2.6) we can use some standard probability theory, similar to the independence-assumption-based step in the heuristic derivation. We will model the left-hand side of (2.6) as a “bag full of balls” problem.Consider a bag with I balls in it, I black ones and w white ones. We are going to pick n balls out of it, one after another, and we want to know the chance ${\displaystyle p(k)}$ that k of them are black. The solution to this problem is known to be the hypergeometric distribution, as given above, with mean ${\displaystyle bn/N}$.

How does this bag full of balls relate to (2.6)? Simply:

• Let our bag be the set B, so ${\displaystyle N=|B|}$.
• Let the black balls in the bag correspond to the elements of the set A1, so that ${\displaystyle b=|A1|}$
• The white balls then correspond to B minus the elements in A1.
• The n balls we are picking are the elements of the set C1, so ${\displaystyle n=|C1|}$.

This yields the formula:

${\displaystyle {\dfrac {bn}{N}}={\dfrac {|A1||C|}{|B|}}}$

What the mean of the hypergeometric distribution gives us is the average of ${\displaystyle |A1\cap C1|}$ over all I1 with the given size constraint, for a fixed A1 with the given size constraint.

But what Equation 5 states is precisely that this mean is equal to ${\displaystyle {\dfrac {|A1||C1|}{|B|}}}$. So, going back to the start of the proof, we have successfully shown that

${\displaystyle \sum \nolimits _{(A,B,C):constr(A,B,C)}{\dfrac {P{\big (}(C\cap A)|B{\big )}}{P(B)}}=P(C|B)P(B|A)}$

It follows similarly that

${\displaystyle \sum \nolimits _{(A,B,C)constr(A,B,C)}{\dfrac {P{\big (}(C\cap A)|(U-B){\big )}}{P(B)}}=P{\big (}C|(U-B){\big )}P{\big (}(U-B)|A)}$

The algebraic transformations made in the heuristic derivation then show that

${\displaystyle E[f(X)]={\dfrac {1}{M}}\sum \nolimits _{\{A,B,C\}:constr(A,B,C)}P(C|A)}$ ${\displaystyle =P(C|B)P(B|A)+{\dfrac {P{\big (}C|(U-B){\big )}P{\big (}(U-B)|A{\big )}}{P(U-B)}}}$ ${\displaystyle =P(B|A)P(C|B)+{\dfrac {{\big (}1-P(B|A){\big )}{\big (}P(C)-P(B)P(C|B){\big )}}{1-P(B)}}}$

Which, after a change from P to s notation, is precisely the formula given in the theorem. Thus the proof is complete. QED

Next, to shed some light on the behavior of this formula, we now supply graphical plots for several different input values. These plots were produced in the Maple software package. Each plot will be preceded by the Maple code used to generate it. To make the Maple code clearer we will set Maple inputs in bold text; e.g.,