RelEx Dependency Relationship Extractor

From OpenCog

Jump to: navigation, search

RelEx, a narrow-AI component of OpenCog, is an English-language semantic dependency relationship extractor, built on the Carnegie-Mellon Link Grammar parser. It uses a series of graph rewriting rules to identify subject, object, indirect object and many other syntactic dependency relationships between words in a sentence. That is, it generates the dependency trees of a dependency grammar. Its set of dependency relations it employs resemble those of Dekang Lin's MiniPar and the Stanford parser (and it has an explicit compatibility mode). It is inspired in part by the ideas of Hudson's Word Grammar.

Unlike other dependency parsers, RelEx attempts a greater degree of semantic normalization: for questions, comparatives, entities, and for prepositional relationships, whereas other parsers (such as the Stanford parser) stick to a literal presentation of the syntactic structure of text. For example, RelEx pays special attention to determining when a sentence is hypothetical or speculative, and to isolating the query variables from a question. Both of these aspects are intended to make RelEx well-suited for question-answering and semantic comprehension/reasoning systems. In addition, RelEx makes use of feature tagging, to tag words with part-of-speech, noun-number, verb-tense, gender, etc.. As of this writing, RelEx parses text nearly four times faster than the Stanford parser; and it now provides a "compatibility mode", wherein it can generate the same relations as the Stanford parser.

  • Relex also includes a basic implementation of the Hobbs anaphora (pronoun) resolution algorithm.
  • Optionally, it can use GATE for entity detection. Note: obsolete; removed from version 1.5.0, last available in version 1.4.2. GATE was less accurate than link-grammar, and made results worse, not better, so was removed..
  • RelEx also provides semantic relationship framing, similar to that of FrameNet. Note: obsolete; removed from version 1.5.0, last available in version 1.4.2. It "worked", but the results were low accuracy, and not good enough for practical use. Thus, it was removed.
  • The SegSim and NLGen2 projects aim to reverse the flow: to generate natural-language output, based on dependency-grammar-like input.
  • The RelEx2Logic output format, suitable for logical reasoning. New, under development.

RelEx is a part of OpenCog, an open-source artificial general intelligence project. See also OpenCog technical information.

Contents

Overview

Perhaps the easiest way to get a flavor of RelEx is to show its output. Below follows a parse of two sentences: "Alice looked at the cover of Shonen Jump. She decided to buy it." Parts of this output will be familiar to users of the Link Grammar Parser. The second part is generated by RelEx, and provides specific feature markup for single words and the dependency relationships between the words in the sentence. Notice, for example, that it is the cover that is being looked at, and that the subject doing the looking is Alice. This example has part-of-speech tagging suppressed, so as not to clutter the output, however, verb tense and noun-number tagging is enabled. Finally, at the end, notice the listing of antecedent candidates: "She" refers to "Alice", and "it" refers either to "Shonen Jump" or to "cover". This output is generated by an implementation of the Hobbs algorithm for pronoun (anaphora) resolution.

Alice looked at the cover of Shonen Jump.

====

Parse 1 of 1

====


(S (NP Alice) (VP looked (PP at (NP (NP the cover) (PP of (NP Shonen Jump))))) .)


Parse 1 of 1


    +--------------------------Xp--------------------------+
    |                         +---Js---+     +----Js----+  |
    +---Wd---+----Ss---+--MVp-+  +--Ds-+--Mp-+    +--G--+  |
    |        |         |      |  |     |     |    |     |  |
LEFT-WALL Alice.f looked.v-d at the cover.n of Shonen Jump . 


======

at(look, cover)
_subj(look, Alice)
tense(look, past)
definite-FLAG(Shonen, T)
noun_number(Shonen, singular)
entity-FLAG(Shonen_Jump, T)
definite-FLAG(Shonen_Jump, T)
noun_number(Shonen_Jump, singular)
of(cover, Shonen_Jump)
definite-FLAG(cover, T)
noun_number(cover, singular)
gender(Alice, feminine)
definite-FLAG(Alice, T)
person-FLAG(Alice, T)
noun_number(Alice, singular)

======
She decided to buy it.

====

Parse 1 of 1

====

(S (NP She) (VP decided (S (VP to (VP buy (NP it))))) .)

    +-----------------Xp----------------+
    +--Wd--+---Ss--+---TO---+--I-+-Ox-+ |
    |      |       |        |    |    | |
LEFT-WALL she decided.v-d to.r buy.v it .

======

_to-do(decide, buy)
_subj(decide, she)
tense(decide, past)
pronoun-FLAG(it, T)
gender(it, neuter)
definite-FLAG(it, T)
_obj(buy, it)
tense(buy, infinitive)
HYP(buy, T)
pronoun-FLAG(she, T)
gender(she, feminine)
definite-FLAG(she, T)
noun_number(she, singular)

======


Antecedent candidates:
_ante_candidate(it, cover) {0}
_ante_candidate(it, Shonen_Jump) {1}
_ante_candidate(she, Alice ) {0}

RelEx Based Language Generation

There are two systems for language generation based on RelEx. The overall idea is named SegSim, and is implemented in two systems: NLGen and NLGen2. The latter is described in greater detail in the paper by Blake Lemoine, NLGen2: A Linguistically Plausible, General Purpose Natural Language Generation System (2009).

Source Code and Development

Source code, development coordination, and bug reporting for RelEx is available at the RelEx Launchpad Site. Developers should join discussions at #opencog on IRC.freenode.net.

All relex discussion should take place on the Link Grammar mailing list.

Source code is written in the Java programming language. The code is released under the Apache v2 license.

The RelExIdeas page discusses possible future projects.

A parsed version of Wikipedia is available here. The parsed text is in the RelEx compact format.

RelEx Installation Procedure

See RelEx Install for details.

Documentation, Algorithm Overview

RelEx borrows some algorithmic ideas from constraint grammars, but applies them in a more abstract setting. Each incoming sentence is represented as a graph, with the words of the sentence representing verticies in the graph. Edges, carrying labels, are used to represent the features of the words, and the structure of the sentence. Initially, the graph is merely a list of words, with edges (labeled "left" and "right") used to indicate the sequence of the words. Parsing, using the Link Grammar parser, is performed; the output of the parse is a set of (labeled) edges indicating the syntactical relationships between words.

From this point on, the main rule engine in RelEx takes over. The engine applies a set of pattern-matching rules to the graph; if the predicate part of the rule matches, then the graph is transformed according to the implicand of the rule. Thus, for example, one rule states that if there is an edge labeled "SFI" (obtained from link grammar), then the word on the left is to be tagged as a verb. The tagging is done by adding an edge called "POS", to a vertex called "verb". After each step, the graph (usually) becomes richer, and more adorned with feature labels and relationship structures, although some rules can also prune the graph. This process, of applying a sequence of rules, resembles the process used in constraint grammars; yet, it differs from constraint grammars in that it operates on a graphical representation, rather that simple sets of tags. This last difference allows RelEx to apply progressively more abstract transformations in analyzing text. The general idea of performing pattern recognition and using it to transform (hyper-)graphs is one of the central concepts within OpenCog; this is why RelEx is a part of OpenCog. The page on sentence algorithms provides a more detailed description of the operation of the rule engine.

The current graph-transformation rules are hand-generated (i.e. designed by linguists). A focus of research interest is to somehow automatically learn and refine these rules by means of corpus statistics. The LexAt project takes baby-steps in this direction. An alternate direction is being taken via OpenCog, using feedback from deductive reasoning to refine parsing. Some early, positive results indicate that very high-speed word-sense disambiguation (WSD) is possible; see blog entry for details.

Relations and Features

The relation and feature markup generated by RelEx resembles that of other projects that generate dependency trees, including Dekang Lin's MiniPar, and the Stanford parser.

The above relations are encoded in the main RelEx 'feature graph', as a set of labeled edges and verticies. There are a variety of different output modules which traverse this graph, and generate different styles of output. It is presumed that users of RelEx would create their own custom output module tailored to suit their needs. This can be easily done by copying an existing output module, and tailoring it to suit one's needs.

Other documentation

Documentation for other output formats, etc.

Papers:

History, Authors

The original version of RelEx was created by Mike Ross in 2005, under the direction of Ben Goertzel. As of 2008, it is maintained by Linas Vepstas. The following folks have made significant contributions (in historical order):

  • Murilo Queiroz
  • Borislav Iordanov
  • Evgenii Philippov
  • Fabricio Silva
  • Blake Lemoine
  • Rui P. Costa
  • Jim White
  • Rodas Solomon
  • Lian Ruiting
  • Jand Hashemi
  • Alex van der Peet

Performance profiling

YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.

YJP was used to identify performance bottlenecks in RelEx. One example is morphological analysis performance, which was improved using a cache.