Sentence algorithms

From OpenCog
Jump to: navigation, search

RelEx was created as an intermediate step in mapping English surface forms to semantic representations.

RelEx outputs simplified representations of the grammatical structure of input sentences. Some surface details of grammatical structure are normalized. For example, many equivalent verb-frame argument assignments are mapped to identical or graphically homomorphic representations. This has the effect of providing some degree of semantic normalization.

Semantic normalization

Here's an example of semantic normalization.

First Sentence
Equivalent Sentence
RelEx Output
Mary ate the cake. The cake was eaten by Mary. eat(subj:Mary, obj:cake)

eat(subj:Mary, obj:cake)

Mary gave John the book. Mary gave the book to John. give(subj:Mary, obj:John, obj2:book)

give(subj:Mary, to: John, obj: book)

Operation Overview

  1. Execute the Link Parser and convert the Link Parser output to a feature structure representation
  2. Execute a series of Sentence Algorithms which modify the feature structure.
  3. Extract the final output representation by traversing the feature structure.

The real meat of RelEx lies in its Sentence Algorithms, which are instances of subclasses of the abstract class SentenceAlgorithm. Some of these algorithm instances , require input in order to instantiate themselves. Others do not need any input. Input for the algorithms, as well as their order of operation is defined in a text Algorithm Definition File, identified by the runtime variable RelEx.algpath. To understand the operation of the algorithms, one must first understand RelEx's feature structure representation and its relationship to the Link Parser's link representation.

Link Parser Representation

This document assumes that the reader understands the output of the CMU link parser, which is documented on the link parser's website.

Feature Structure Representation

RelEx converts the output of the link parser into a feature structure. A feature structure is a directed graph in which each node contains either a value, or an unordered list of features. In RelEx, the value of a feature is always a string, though it may be interpreted as another type, such as an int. A feature is just a labeled link to another node.

Step 1: Converting Link Parser Output to Feature Structure Representation

Below is a representation of the link parse structure for the sentence “This is a test.”

    +------------Xp------------+
    |              +--Ost--+   |
    +---Wd---+-Ss*b+  +-Ds-+   |
    |        |     |  |    |   |
LEFT-WALL this.p is.v a test.n . 

The sentence is converted to a feature structure which is used as the input of RelEx. Each word in the sentence is represented by a node with several features:

  • NEXT [node]: points to the node representing the following word (if there is one)
  • PREV[node]: points to the node representing the previous word (if there is one)
  • this [node]: points to itself, a convenience for certain sentence algorithms
  • wall [node]: points to the node representing the LEFT-WALL (see Link Parser documentation)
  • index_in_sentence [int]: the index of the word in the sentence. The LEFT-WALL has index 0, and the first word of the sentence has index 1.
  • start_char [int]: the character index of the first character in the word (from the original sentence, starting with 0 for the first character of the first word)
  • end_char [int]: the character index following the last character in the word (from the original sentence, thus end_char - start_char = word length)
  • str [string]: the string value of the word. This is the same as would be returned by taking the original sentence as a string and calling sentence.substring(start_char, end_char)
  • POS [string]: a string value representing the part of speech of the word.
  • num_left_links [int]: the number of links going left from this word
  • num_right_links [int]: the number of links going right from this word
  • linkL0 [node]: points to a node representing a link going left from this word
  • linkL(1,2,3,...) [node]: points to a node representing another link going left from this word
  • linkR0 [node]: points to a node representing a link going right from this word
  • linkR(1,2,3,...) [node]: points to a node representing another link going right from this word

Similarly, each link in the sentence parse is represented by a node with several features.

  • LAB [string]: the link label
  • F_L[node]: the word on the left side of the link
  • F_R[node]: the word on the right side of the link

Relex-f1.png

$6[NEXT $4[str <<is>>
           linkL0 $8[LAB <<Ss*b>>
                     F_L $6
                     F_R $4]
           PREV $6]
  str <<This>>
  LinkR0 $8]

The figure shows a subset of nodes and features from RelEx's representation of the sample sentence. Here, the first two words of the sentence, connected by a Ss*b link are show. One may think of the node labeled $6 as representing the first word of the sentence. The word's “str” feature has the value “This”, and its “NEXT” feature points to a node representing the following word.

Next is RelEx's awkward but accurate notation for its feature structures (triggered by verbose output command line option -v). Every node is represented by a set of square brackets, containing its list of feature value pairs, optionally preceded by a dollar-sign-prefixed ID number. The square brackets and feature value pairs only appear the first time a node is referenced. After that, the ID is used alone. Indentation is used to assist in readability, so that all of a node's features are vertically aligned. String values are enclosed in double-angle brackets.

Below, we see the entire feature structure notation for the sample input sentence, “This is a test.”

$1[NEXT $6[NEXT $4[NEXT $3[NEXT $0[NEXT $2[POS <<punctuation>>
                                           PREV $0
                                           end_char <<15>>
                                           index_in_sentence <<5>>
                                           linkL0 $9[LAB <<Xp>>
                                                     F_L $1
                                                     F_R $2]
                                           num_left_links <<1>>
                                           start_char <<14>>
                                           str <<.>>
                                           this $2
                                           wall $1]
                                   POS <<noun>>
                                   PREV $3
                                   end_char <<14>>
                                   index_in_sentence <<4>>
                                   linkL0 $7[LAB <<Ost>>
                                             F_L $4
                                             F_R $0]
                                   linkL1 $5[LAB <<Ds>>
                                             F_L $3
                                             F_R $0]
                                   num_left_links <<2>>
                                   start_char <<10>>
                                   str <<test>>
                                   this $0
                                   wall $1]
                           POS <<det>>
                           PREV $4
                           end_char <<9>>
                           index_in_sentence <<3>>
                           linkR0 $5
                           num_right_links <<1>>
                           start_char <<8>>
                           str <<a>>
                           this $3
                           wall $1]
                   POS <<verb>>
                   PREV $6
                   end_char <<7>>
                   index_in_sentence <<2>>
                   linkL0 $8[LAB <<Ss*b>>
                             F_L $6
                             F_R $4]
                   linkR0 $7
                   num_left_links <<1>>
                   num_right_links <<1>>
                   start_char <<5>>
                   str <<is>>
                   this $4
                   wall $1]
           POS <<noun>>
           PREV $1
           end_char <<4>>
           index_in_sentence <<1>>
           linkL0 $10[LAB <<Wd>>
                      F_L $1
                      F_R $6]
           linkR0 $8
           num_left_links <<1>>
           num_right_links <<1>>
           start_char <<0>>
           str <<this>>
           this $6
           wall $1]
   POS <<WORD>>
   end_char <<-1>>
   index_in_sentence <<0>>
   linkR0 $9
   linkR1 $10
   num_right_links <<2>>
   start_char <<-1>>
   str <<LEFT-WALL>>
   this $1
   wall $1]


This notation can be a bit difficult to read. But the wise algorithm writer will invest some time in learning to navigate its treacherous terrain: an absolutely necessity for debugging problematic sentence algorithms. The notation can be made more readable for a particular task by limiting which features are actually printed. This can be done by instantiating a FeatureNameFilter, which specifies a subset of features to be printed, and their order. Instances of this class are passed to the toString method of FeatureNode.

Step 2: SentenceAlgorithm Execution

The SentenceAlgorithmApplier loads a list of SentenceAlgorithms from the Algorithm Definition File. The SentenceAlgorithms are executed in the order they are listed in the file. For each SentenceAlgorithm, RelEx iterates through every single feature node in the feature structure, and attempts to apply the algorithm to each node. SentenceAlgorithms will only execute if their canApplyTo(FeatureNode) method returns true. SentenceAlgorithms modify the feature structure, so their order of operation is important. For example, an algorithm which only operates on nodes with a POS feature wont work if a previous algorithm has deleted that feature from all nodes in the feature structure.

All sentence algorithms append a signature name to the value of the SIG feature for any node on which they are successfully applied. This is useful for debugging, because one can tell which algorithms have been applied to which nodes. Often a problem with a new algorithm is that it applies itself in an unexpected place having consequences that interfere with other algorithms.

The most important feature added by the sentence algorithms is the ref feature, which points to an entity represented (or referred to) by a word. In the final stage of operation, RelEx uses the ref features to generate its final output representation.

Step 3: Extracting Final Representation from the Feature Structure Representation

The node representing the left-wall (see Link Parser documentation) is conveniently also used to collect pointers to the nodes representing RelEx's final representation. The head feature points to the target(s) of the ref feature from the main clause(s) of the sentence and all subclauses. The background feature points to other relationships, such as those derived from adjective and adverb modification.

Thus, the features used to generate the final representation may be thought of as a subset of the entire feature structure. This can be printed using the ParsedSentence.printZHeads() method, which creates a FeatureNameFilter to isolate the correct subset of the feature structure. Another technique is used by RelExInfoAnnotator.computeParse(). This method traverses the FeatureNodes representing the words in the sentence. For each one, it follows the ref link, then follows subsequent links to build up a dependency relation representation.

The features printed by ParsedSentence.printZHeads, or followed by RelExInfoAnnotator.computeParse include:

From FeatureNodes representing words:

  • ref: points to feature node representing the word's referent.

From FeatureNodes representing referents (The full set of these are documented on the page word properties)

  • name: points to a string value for the name of the refent.
  • tense: if the referent is a verb/event, points to a string value representing a tense
  • HYP: if the referent is a verb/event, points to a string with value “T” iff the event is hypothetical.
  • TRUTH-QUERY-FLAG: if the referent is a verb/event, points to a string with value “T” iff the event a question (i.e., 'eat' in “Did John eat the cake?”).
  • COPULA-QUERY-FLAG: Points to the string “T” for particular entities involved in particular forms of copula questions (i.e., 'John' in “Who is john?”).
  • noun_number: if the referent is a noun/thing, points to a string value representing a noun number
  • links: points to a a feature node, with features representing the dependency relations in which this referent is the first argument.
  • memberN: if the referent represents a group of things, it will contain only memberN features where N is an integer, and the memberN feature points to the Nth member in the group.

From FeatureNodes representing a collection of links (the full set of these are documented on the page binary relations:

  • _subj: points to the second argument of a _subj relation. This is a specific example of the _X case below.
  • _X: points to the second argument of a _X relation, where X is a relation whose name is not directly derived from a word in the sentence (i.e., a syntactic relation).
  • X: points to the second argument of a X relation, where X is a relation whose name is derived directly from a word in the sentence (i.e., a preposition).

Example: “Mike knows that John thinks daily.”

Feature Node notation of Final Representation (from printZHeads method):

[head [name <<know>>
       tense <<present>>
       links [_subj [name <<Mike>>
                     DEFINITE-FLAG <<T>>
                     noun_number <<singular>>
                     pos <<noun>>]
              that $0[name <<think>>
                      tense <<present>>
                      links [_subj [name <<John>>
                                    DEFINITE-FLAG <<T>>
                                    noun_number <<singular>>
                                    pos <<noun>>]]
                      HYP <<T>>
                      pos <<verb>>]]
       pos <<verb>>]
 background [name <<daily>>
             links [_advmod $0]
             pos <<adv>>]]

Dependency Relation notation of Final Representation (parts of speech not shown):

DEFINITE-FLAG(John, T)
noun_number(John, singular)
_subj(think, John)
tense(think, present)
HYP(think, T)
_advmod(daily, think)
DEFINITE-FLAG(Mike, T)
noun_number(Mike, singular)
_subj(know, Mike)
that(know, think)
tense(know, present)

Frame notation of Final Representation:

know(subj: Mike, that: think)
think(subj: John)
daily(subj-r: think)

Algorithm Definition File Format

The Algorithm Definition File is a text file which describes the order in which the sentence algorithms should run. For each sentence algorithm, it lists the algorithm's java class name, preceeded by a pound sign (#). The class is looked up using Java refelection. All subsequent text until the next pound sign is passed to the class instance for instantiation. Lines beginning with semi-colans are ignored. For example:

;a comment line
#Algorithm1WhichRequiresNoInputText
;another comment line which is ignored
#Algorithm2WhichRequiresInputText
Input Text for Algorithm2WhichRequiresInputText 
;another comment line which is ignored
More Input Text for Algorithm2WhichRequiresInputText 
...

A Detailed Example: TemplateActionAlg

The most commonly used and versatile sentence algorithm in RelEx is the TemplateActionAlg, which searches for a substructure in the feature node network, and alters it. A TemplateActionAlg instance accepts text input. The first line is a custom signature name for the instance (see discussion in section Operation 2). Following this are a set of template paths and action (output) paths, separated by an equals sign (=).

SIGNATURE_NAME
template_path1
template_path2...
template_pathN
=
action_path1
action_path2 ...
action_pathN

Paths

A path is a list of feature names, enclosed in angle brackets, which denotes a path through the feature structure. Template and action paths used in TemplateActionAlg are followed by an operator and a target value. If a point in the feature structure matches the template path, the action path is used to alter the feature structure. The example below shows a simple TemplateActionAlg which matches the feature structure at node $8. The action path adds two nodes to the feature structure.

Template Action
<F_R str> = is

(Matches pink nodes, starting at $8)

<F_R ref name> = be

(Adds pink nodes)

Relex-f2.png Relex-f3.png

Instances of TemplateActionAlg work the same way as all SentenceAlgorithms: by exhaustively searching the entire sentence feature structure for nodes where canApplyTo() returns true. In their case, canApplyTo() returns true iff a feature node can match all of the template paths for the algorithm instance. In other words, the set of template paths are conjoined. Note above that the node, $8, is the only node which matches the template path, <F_R str>=is. The target value for the template is the string “is.”

Special Path Operators and Target Values

As described so far, a TemplateActionAlg, must have very specific templates, and modify the feature structure purely by adding to it. However, real TemplateActionAlgs have a richer functionality, defined by various special operator and target value symbols.

Template Operators

=
Matches iff the feature node value is equal to the path target
!=
Matches iff the feature node value is NOT equal to the path target

Action Operators

=
Sets the feature node's value to the path target
+=
Appends the path target to the feature node's value

Template Path Targets

Default: the target value is interpreted as a normal string, x
%
The target value is interpreted as nil. Thus the path cannot exist in the feature structure.
<x y z>
(path)
The target value is interpreted as the value found by following the path, <x y z>
$var
The target value must exist, but it can be anything. It is stored in a variable named var in the java code.
|
Boolean OR: separates multiple possible matches, any of which is valid.
\
Introduces a regular-expression syntax (see next two items below). Normal regular expression syntax can be used with the exception of the two characters: | (OR) and . (period). Periods must be preceded by a slash, and have a slightly different meaning from normal regular expression periods (see next entry). All regular expressions are read between | OR characters, so these cannot be part of the regular expression itself.
\.
Matches any lowercase letter or asterix (in regular-expression-like syntax). This is accomplished in the java code by substituting the substring [a-z*]
\\
Cancels the regular expression syntax, and is substituted in the java code as a single \

Action Path Targets

Default: the target value is interpretted as a normal string, x
%
The target value is set to nil.
<x y z>
(path)
The target value is set to the value found by following the path, <x y z>

TemplateActionAlg Examples

The reader is advised to draw sample feature structures for the examples below as an exercise

Example 1

#TemplateActionAlg
POS_NOUN
<LAB> = \S\.*|\SX\.*|\SF\.*|\AN\.*|\GN\.*|\YS\.* |\YP\.*
<F_L POS> = WORD
=
<F_L POS> = noun

Template:

<LAB> = \S\.*|\SX\.*|\SF\.*|\AN\.*|\GN\.*|\YS\.* |\YP\.*:

Since “LAB” is a feature of feature nodes representing link parser links, this algorithm is designed to match feature nodes representing links of type S, SX, SF, AN, GN, YS, or YP. Notice that each link type is followed by “\.*” indicating any number of lowercase or non-letter characters. Thus, the algorithm ignores the link subtype (the subtype is indicated by lowercase and non-letter characters in a link's name).

<F_L POS> = WORD

The algorithm only matches if the word to the link's left has not yet been assigned a POS.

Action:

<F_L POS> = noun

This assigns a POS of “noun” to the link's left word. The motivation for this algorithm is revealed by a a quick check of the link parser documentation for link types S, SX, SF, AN, GN, YS, and YP. All of these links take nouns on their left.

Example 2

#TemplateActionAlg
SPECIAL-ADJ
; Used for "easy to read." 
; The B links back to the adjective. But it should really be
; interpretted as linking back to the subject of the adjective
<LAB> = \B\.*|\BW\.*
<F_L POS> = adj
=
<F_R obj> = %
<F_R obj> = <F_L subj>
<F_L ADJ-OBJ-FLAG> = T

Template:

<LAB> = \B\.*|\BW\.*

This algorithm is designed to match feature nodes representing link parser B or BW links. Any type of B or BW link is allowed, since they can be followed by an arbitrary number of lower-case or non-letter characters (“\.*”). Note that “\.” matches only lower case characters, so we must boolean options for both B and BW.

<F_L POS> = adj

The link must also have a word on its left which is an adjective.

Action:

<F_R obj> = %

The algorithm will clear the value of the obj feature of the link's right target.

<F_R obj> = <F_L subj>

It then sets this value to be the same as the subj feature's value in the link's right target.

<F_L ADJ-OBJ-FLAG> = T

Finally, it sets a feature named ADJ-OBJ-FLAG to value T in the link's left target.

In the phrase “The book is easy to read” this has the effect of declaring that the object of “read” is the same as the subject of “easy” (which was previously unified with the subject of “is”). Thus the thing being read is a book. The ADJ-OBJ-FLAG feature is used as a marker-flag on the feature structure to control the execution of subsequent algorithms (a common technique in the Algorithm Definition File).

A Few Selected Other Algorithms

MorphyAlg

MorphyAlg uses WordNet (via external calls), plus some additional java code to find the root form of words. It is entirely defined by its Java code, and takes no additional input. Hence its format in the Algorithm Definition File is simply a single line:

#MorphyAlg

WordSequenceCombineAlg

WordSequenceCombineAlg is a sister class of TemplateActionAlg (they are both subclasses of the abstract TemplateMatchingAlg). It uses the exact same template syntax, but uses hard-coded Java code to execute its action, which would have been difficult to describe with normal action path syntax.

#WordSequenceCombineAlg
PROPER_NAME_AND_ENTITIES_AND_IDIOMS
<LAB> = \G\.*|\ID[A-Z]*[a-z]*
<F_R> = $right
=

The algorithm applies to G links and links whose main type begins with ID. These links connect words in a proper name or idiom. It sets the variable “$right” to point to the feature node on the link's right. A look at the Java code shows that the algorithm sets a number of features in the right-most word for a sequence of words belonging to the same proper name or idiom. Most importantly, the referent's name feature for the final word in the sequence is set to a long string made up of the entire sequence (with underscores substituted for spaces). Thus, a setentence containing “Bill Murray” will have the following hold for the word “Murray”: <ref name> = Bill_Murray.

Increasing Coverage: Guidelines

When attempting to increase RelEx's coverage for a particular sentence structure, there are a number of questions one should ask.

  • What are the important links that define this sentence structure?
  • Are there already sentence algorithms designed to handle these links?
  • If so, why are they failing?
    • Is there a bug in the algorithms?
    • Is this a bug in the link parser (producing the wrong link)?
    • Is this an irregular use of the relevant links? That is, are there other sentences which use the same link type in an incompatible way?
    • Is there a way to handle the situation using existing SentenceAlgorithm classes? Often a set of TemplateActionAlgs which set “FLAG” features to communicate can be a very effective way of executing a complicated action using multiple SentenceAlgorithms operation in sequence (see “SPECIAL-ADJ” example above).

When adding new algorithms, one must be careful not to accidentally reduce RelEx's coverage for other sentences. Ideally, the programmer should maintain a large file of test cases, which can be checked automatically. In addition, the writer should also come up with several new sentences which are similar to the target sentence in various ways, to make sure they are handled correctly.