RelEx OpenCog format

From OpenCog
(Redirected from WordNode)
Jump to: navigation, search

The RelEx OpenCog output format is used to represent RelEx parse information as OpenCog hypergraphs. It consists of the following basic sections:

  1. Sentences grouped into documents
  2. Parses grouped into sentences
  3. Word Instances grouped into parses
  4. Word lemmas associated with word instances
  5. WordNodes associated with word instances
  6. Link-grammar linkage information between word instances.
  7. RelEx dependency relations between word instances
  8. Anchor representing new input.

Usage

This output format has been in use by the OpenCog NLP subsystem since March 2008, and there is a large variety of code, written Java, perl, C++ and scheme, that assumes this particular format. Please do not invent a new format; please avoid modifying this format, if at all possible. Note that not all of the code that uses this format is in the OpenCog source; some is in the RelEx package, and some is in the LextAt package.

There are two primary ways of generating this output:

  • The RelEx src/perl/cff-to-opencog.pl perl script. This is a batch processing script; it will take previously parsed text, stored in the CFF format, and turn it into the format documented here. This is ideal for bulk processing of things like Wikipedia pages, since the high cost of parsing is paid just once, and converting from CFF to OpenCog is very fast.

There are two examples of how this format is pulled into OpenCog:

Note that while new MindAgents could use the OpenCog hypergraphs described here, directly, it probably makes more sense to run input text through the triples, seme and relex-to-frame processing code first, in order to get slightly more abstract and manageable hypergraphs. On the other hand, as of May 2009, the triple, seme, and relex-2-frame code is under construction, and only partially functioning. Caveat Emptor.

Components

The representation of parsed text in OpenCog introduces a number of new Node and Link types. In principle, new node and link types are not really needed; however, by introducing these, it becomes a lot easier to traverse the hypergraph of a parsed sentence, and find the needed/desired information. In addition, the hypergraph representing a parse becomes much smaller.

Most links are given a SimpleTruthValue of strength 1.0, confidence of 1.0. The ParseLink is given a simple truth value with a strength of 1.0, but confidence of a smaller value, as assigned by a simple parse-ranking algorithm.

A fully parsed sentence, "Humans have two feet", is given at the bottom, with examples taken from that parse.

SentenceNode, ParseNode, ParseLink

A SentenceNode serves as an anchor for parses associated with a particular sentence. There is only one SentenceNode per sentence. For example:

(SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")

The name of the sentence node is a unique string, meant to uniquely identify this sentence. It has no particular meaning. Currently, it is in the form of sentence@UUID, where the UUID is a 128-bit MD5 hash. This large UUID is used in order to avoid the birthday paradox for tagging items, and it is printed in ASCII in order to make it human readable (and grep-able, etc.).

A ParseNode serves as an anchor for word instances associated with a particular parse. The ParseNode has a SimpleTruthValue associated with it that provides the parse ranking for that parse. It is expressed as a numerical value for the confidence of that parse. For example:

(ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0" (stv 1.0 0.9417)

Note that there may be multiple parses for any given sentence. The parse-node name is numbered, from most-likely to least-likely parse; this is purely for debugging convenience, and should not be assumed by any code.

A ParseLink connects parses to sentences. For example:

(ParseLink
   (ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0" (stv 1.0 0.9417))
   (SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
)

DocumentNode, SentenceLink, SentenceSequenceLink

Documents consisting of multiple sentences will be indicated by means of a DocumentNode, together with a SentenceLink. For example:

(SentenceLink
   (SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
   (DocumentNode "document@308a2446-046c-4f00-bf8d-7e4d2f256875")
)

The above indicates that the given sentence is a part of the document. Sentence order is indicated by SentenceSequenceLink's:

(SentenceSequenceLink (stv 1 1)
   (SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
   (NumberNode "42")
)

It might someday be useful to group sentences together into paragraphs, or documents into collections, or to indicate embedded media (tables, graphics, footnotes, margin notes, sounds, movies, etc.). However, at this time, there is no such markup defined. However, see the file seme/README for notes on how input sentences are tagged with the name of a speaker during IRC chatbot sessions, and/or other text sources.

WordInstanceNode, WordInstanceLink, WordSequenceLink

Word instances are unique, individual instances of words occurring in a given parse. The WordInstanceNode is used to represent these. For example:

(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")

These are created with unique names because the same word may occur multiple times within one sentence, or one document, and one must be able to tell them apart. Word instances are tagged with feature data, such as tense, number, part-of-speech. As a result, each word instance is associated with a particular parse, since different parses may assign different feature data to different word instances.

The format of name is word@UUID. The word is there to make it human-readable, and thus easier for manual debugging. The UUID is there in order to make sure that every word instance is uniquely identified. The UUID is large, in order to avoid the Birthday Paradox that can occur when tagging items with unique labels. Although the UUID can be anything, in practice, a 128-bit MD5 hash is used. It is spelled out in ASCII to make it human-readable, and thus easier to debug.

WordOrder is indicated with WordSequenceLinks:

(WordSequenceLink (stv 1.0 1.0) 
   (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
   (NumberNode "42")
)
(WordSequenceLink (stv 1.0 1.0)
   (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
   (NumberNode "43")
)
(WordSequenceLink (stv 1.0 1.0)
   (WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
   (NumberNode "44")
)
(WordSequenceLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (NumberNode "45")
)

To verify word-order in the pattern matcher for word-order-dependent tasks, use the GreaterThanLink.

There is no guarantee that the issued numberings are sequential over all time; instead, they come from a counter that is restarted whenever the RelEx server is restarted. If long-term, large-scale sequential ordering is needed, a different mechanism should be invented.

Word instances are anchored to parses by means of the WordInstanceLink. Given a ParseNode, it becomes very easy to find all word instances associated with that parse. For example:

(WordInstanceLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)

WordNode, LemmaLink

The WordNode is used to indicate a word. The name of this node is the word itself. For example:

(WordNode "feet")

Note that the name of this atom is literally the word.

Given a word instance, is useful to know what the underlying word is. This is done by using a reference link. For example:

(ReferenceLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (WordNode "feet")
)

Although, in principle, it might have been possible to get the name of the WordInstanceNode, look for the @ sign in it, and take the word to be everything before the @ sign, in practice, this is not reasonable. In particular, it is conventient to refer to WordNodes direction in ImplicationLinks, thus WordNodes are needed.

In the above, a SimpleTruthValue was added to indicate that it is true that this is the word associated with this word-instance, and that we are completely sure of it.

In addtion to knowning the word, it can often be important to know the word lemma. This is accomplished with the LemmaLink. The lemma of feet is foot, and so, for example:

(LemmaLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (WordNode "foot")
)

Note that there may be thousands or millions of ReferenceLinks and LemmaLinks to a given WordNode. These may be followed to find every sentence a word appears in.

LinkGrammarRelationshipNode

Link grammar links are represented with an EvaluationLink predicate. The name of the link is given in the name of the LinkGrammarRelationshipNode. For example:

(EvaluationLink (stv 1.0 1.0)
   (LinkGrammarRelationshipNode "Op")
   (ListLink
      (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
      (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   )
)

Note that this links together word-instances; this is required, as different parses will have different link-grammar linkages. The new, unique node type LinkGrammarRelationshipNode is used, so as to make it easy to destinguish these links from other EvaluationLinks.

PartOfSpeechLink

Word-instance features are marked with DefinedLinguisticConceptNode, with the exception of part-of-speech, which uses the PartOfSpeechLink. For example:

; pos (foot, noun)
(PartOfSpeechLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (DefinedLinguisticConceptNode "noun")
)

and

; noun_number (foot, plural)
(InheritanceLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (DefinedLinguisticConceptNode "plural")
)

These are given SimpleTruthValues of 1,1 to indicate that, for this given parse, the feature assignment is true, and completely confident.

An alternative to using DefinedLinguisticConceptNodes in the above is is create NounNumberLink's, VerbTenseLink's, etc. At this time, there does not appear to be any pressing need to use such an alternate format.

DefinedLinguisticConceptNode

Represents concepts that are linguistic-related , they are generated by RelEx and R2L at the moment.

; For speech act
(InheritanceLink
   (InterpretationNode "sentence@0e644cf6-cb75-4a98-b15d-e937424ffaa1_parse_0_interpretation_$X")
   (DefinedLinguisticConceptNode "DeclarativeSpeechAct")
)

; For tense
(InheritanceLink
   (PredicateNode "is@af4555f6-905c-49e6-8388-ff66048e7c10")
   (DefinedLinguisticConceptNode "present")
)

DefinedLinguisticRelationshipNode

RelEx dependency relations use EvaluationLink predicates, with the predicate type of DefinedLinguisticRelationshipNode. For example:

; _%quantity (foot, two) 
(EvaluationLink (stv 1.0 1.0)
   (DefinedLinguisticRelationshipNode "_%quantity")
   (ListLink
      (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
      (WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
   )
)

Again, a simple truth value of 1,1 is used to indicate that, for this given parse, we are completely certain that this relationship occurs.

AnchorNode

The AnchorNode is used to tell OpenCog MindAgents that this given sentence has just been input, for the very first time, into the AtomSpace. The AnchorNode is a very simple mechanism to put things in the AtomSpace where other processes can find them. So, if you have "SomeAtoms" and don't want to loose track of them, you create the following:

ListLink
    AnchorNode "my-stuff"
    SomeAtoms

Then as long as you can remember that your stuff is called "my-stuff", you just have to look at the incoming set for the AnchorNode, which finds the ListLink, which links your stuff. In the case of RelEx,

(ListLink
   (AnchorNode "# New Parsed Sentence")
   (SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
)

Thus, mind agents in charge of dealing with recently input text can scan for links to this particular AnchorNode (it will always have the name "# New Parsed Sentence"). These mind-agents can do as they wish with this link: typically, once processing is complete, the list-link is deleted, freeing this sentence from this anchor. Typically, the output of a mind-agent will be attached to some other AnchorNode, so as to pass off processing to other mind agents.

Note that code in the opencog/nlp/chatbot and the opencog/nlp/triples directories expect this link to this anchor to be there; else this code will fail to run.

Example output

Below follows the output of the parse of "Humans have two feet", in full gory detail, as generated by RelEx version 0.99.0, as of May 2009.

(ReferenceLink (stv 1.0 1.0)
   (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
   (WordNode "humans")
)
(WordInstanceLink (stv 1.0 1.0)
   (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
   (ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)
(ReferenceLink (stv 1.0 1.0)
   (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
   (WordNode "have")
)
(WordInstanceLink (stv 1.0 1.0)
   (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
   (ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)
(ReferenceLink (stv 1.0 1.0)
   (WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
   (WordNode "two")
)
(WordInstanceLink (stv 1.0 1.0)
   (WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
   (ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)
(ReferenceLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (WordNode "feet")
)
(WordInstanceLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)
(EvaluationLink (stv 1.0 1.0)
   (LinkGrammarRelationshipNode "Op")
   (ListLink
      (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
      (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   )
)
(EvaluationLink (stv 1.0 1.0)
   (LinkGrammarRelationshipNode "Dmc")
   (ListLink
      (WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
      (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   )
)
(EvaluationLink (stv 1.0 1.0)
   (LinkGrammarRelationshipNode "Sp")
   (ListLink
      (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
      (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
   )
)
(EvaluationLink (stv 1.0 1.0)
   (LinkGrammarRelationshipNode "Wd")
   (ListLink
      (WordInstanceNode "LEFT-WALL")
      (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
   )
)
(ParseLink
   (ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0" (stv 1.0 0.9417))
   (SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
)
(LemmaLink (stv 1.0 1.0)
   (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
   (WordNode "humans")
)
(LemmaLink (stv 1.0 1.0)
   (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
   (WordNode "have")
)
(LemmaLink (stv 1.0 1.0)
   (WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
   (WordNode "two")
)
(LemmaLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (WordNode "foot")
)
; noun_number (humans, plural)
(InheritanceLink (stv 1.0 1.0)
   (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
   (DefinedLinguisticConceptNode "plural")
)
; inflection-TAG (humans, .n)
(InheritanceLink (stv 1.0 1.0)
   (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
   (DefinedLinguisticConceptNode ".n")
)
; pos (humans, noun)
(PartOfSpeechLink (stv 1.0 1.0)
   (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
   (DefinedLinguisticConceptNode "noun")
)
; pos (two, adj)
(PartOfSpeechLink (stv 1.0 1.0)
   (WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
   (DefinedLinguisticConceptNode "adj")
)
; _%quantity (<<foot>>, <<two>>) 
(EvaluationLink (stv 1.0 1.0)
   (DefinedLinguisticRelationshipNode "_%quantity")
   (ListLink
      (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
      (WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
   )
)
; noun_number (foot, plural)
(InheritanceLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (DefinedLinguisticConceptNode "plural")
)
; inflection-TAG (foot, .n)
(InheritanceLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (DefinedLinguisticConceptNode ".n")
)
; pos (foot, noun)
(PartOfSpeechLink (stv 1.0 1.0)
   (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   (DefinedLinguisticConceptNode "noun")
)
; _subj (<<have>>, <<humans>>) 
(EvaluationLink (stv 1.0 1.0)
   (DefinedLinguisticRelationshipNode "_subj")
   (ListLink
      (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
      (WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
   )
)
; _obj (<<have>>, <<foot>>) 
(EvaluationLink (stv 1.0 1.0)
   (DefinedLinguisticRelationshipNode "_obj")
   (ListLink
      (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
      (WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
   )
)
; tense (have, present)
(InheritanceLink (stv 1.0 1.0)
   (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
   (DefinedLinguisticConceptNode "present")
)
; inflection-TAG (have, .v)
(InheritanceLink (stv 1.0 1.0)
   (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
   (DefinedLinguisticConceptNode ".v")
)
; pos (have, verb)
(PartOfSpeechLink (stv 1.0 1.0)
   (WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
   (DefinedLinguisticConceptNode "verb")
)

(ListLink
   (AnchorNode "# New Parsed Sentence")
   (SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
)
; END OF SENTENCE