Sentence representation
The Opencog natural language subsystem represents natural language sentences using atomese. This page documents the format currently generated by LgParseLink and LgParseMinimal. An earlier version of this page is at Sentence representation (2018 Archive); it documents the now-obsolete RelEx specifics.
This page documents an Atomese representation of the parse information generated by the Link Grammar parser. The Link Grammar parser is a generic parser, capable of parsing both natural language and abstract data streams, to obtain structural information. It has hand-written dictionaries for English, Russian, Farsi, Thai, and a dozen other languages. It is also being used as a part of an automated structure (syntax, grammar, language) learning system. This later use is the primary driver of the contents of this wiki page. For older docs, see Sentence representation (2018 Archive).
The format described here is intentionally made to be backwards-compatible with the earlier RelEx system. It is a bit clunky, awkward and hard to use. The LgParseDisjuncts, LgParseSections andLgParseBonds parsers provide a newer and simpler interface to certain subsets of the parse data (namely, the disjuncts, Sections or Bonds, only.)
Overview
The representation consists of the following basic sections:
- Sentences consisting of multiple parses.
- Word Instances appearing in a specific parse.
- WordNodes associated with word instances.
- Link-grammar linkage information between word instances.
The output described below is generated by the LgParseLink Atom, which is a wrapper around the Link Grammar parser. Specific subsets of the total parse graph are provided by LgParseMinimal, LgParseDisjuncts and LgParseSections.
SentenceNode, ParseNode, ParseLink
A SentenceNode serves as an anchor for parses associated with a particular sentence. There is only one SentenceNode per sentence. For example:
(SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
The name of the sentence node is a unique string, meant to uniquely identify this sentence. It has no particular meaning. Currently, it is in the form of sentence@UUID, where the UUID is a 128-bit MD5 hash. This large UUID is used in order to avoid the birthday paradox for tagging items, and it is printed in ASCII in order to make it human readable (and grep-able, etc.).
A ParseNode serves as an anchor for word instances associated with a particular parse. The ParseNode has a SimpleTruthValue associated with it that provides the parse ranking for that parse. It is expressed as a numerical value for the confidence of that parse. For example:
(ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0" (stv 1.0 0.9417)
Note that there may be multiple parses for any given sentence. The parse-node name is numbered, from most-likely to least-likely parse; this is purely for debugging convenience, and should not be assumed by any code.
A ParseLink connects parses to sentences. For example:
(ParseLink
(ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0" (stv 1.0 0.9417))
(SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
)
WordInstanceNode, WordInstanceLink, WordSequenceLink
Word instances are unique, individual instances of words occurring in a given parse. The WordInstanceNode is used to represent these. For example:
(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
These are created with unique names because the same word may occur multiple times within one sentence, or one document, and one must be able to tell them apart. Word instances are tagged with feature data, such as tense, number, part-of-speech. As a result, each word instance is associated with a particular parse, since different parses may assign different feature data to different word instances.
The format of name is word@UUID. The word is there to make it human-readable, and thus easier for manual debugging. The UUID is there in order to make sure that every word instance is uniquely identified. The UUID is large, in order to avoid the Birthday Paradox that can occur when tagging items with unique labels. Although the UUID can be anything, in practice, a 128-bit MD5 hash is used. It is spelled out in ASCII to make it human-readable, and thus easier to debug.
WordOrder is indicated with WordSequenceLinks:
(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
(NumberNode "42")
)
(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
(NumberNode "43")
)
(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
(NumberNode "44")
)
(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
(NumberNode "45")
)
To verify word-order in the pattern matcher for word-order-dependent tasks, use the GreaterThanLink.
There is no guarantee that the issued numberings are sequential over all time; instead, they come from a counter that is restarted whenever the RelEx server is restarted. If long-term, large-scale sequential ordering is needed, a different mechanism should be invented.
Word instances are anchored to parses by means of the WordInstanceLink. Given a ParseNode, it becomes very easy to find all word instances associated with that parse. For example:
(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
(ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)
WordNode
The WordNode is used to indicate a word. The name of this node is the word itself. For example:
(WordNode "feet")
Note that the name of this atom is literally the word.
Given a word instance, is useful to know what the underlying word is. This is done by using a reference link. For example:
(ReferenceLink
(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
(WordNode "feet")
)
Please avoid the hacky solution of searching for the @ sign, and take the word to be everything before the @ sign. This is a bad idea. Don't.
LgLinkNode
Link grammar links are represented with an EvaluationLink predicate. The name of the link is given in the name of the LgLinkNode. For example:
(EvaluationLink
(LgLinkNode "Op")
(ListLink
(WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
)
)
Note that this links together word-instances; this is required, as different parses will have different link-grammar linkages. The LgLinkNode is used, to make it easy to distinguish these links from other EvaluationLinks.
Example output
Below follows the output of the parse of "Humans have two feet", in full gory detail, as generated by LgParseLink
(ReferenceLink
(WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
(WordNode "humans")
)
(WordInstanceLink
(WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
(ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)
(ReferenceLink
(WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
(WordNode "have")
)
(WordInstanceLink
(WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
(ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)
(ReferenceLink
(WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
(WordNode "two")
)
(WordInstanceLink
(WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
(ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)
(ReferenceLink
(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
(WordNode "feet")
)
(WordInstanceLink
(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
(ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
)
(EvaluationLink
(LgLinkNode "Op")
(ListLink
(WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
)
)
(EvaluationLink
(LgLinkNode "Dmc")
(ListLink
(WordInstanceNode "two@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")
)
)
(EvaluationLink
(LgLinkNode "Sp")
(ListLink
(WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
(WordInstanceNode "have@0f223d17-31a6-49fe-9d37-350d50c53926")
)
)
(EvaluationLink
(LgLinkNode "Wd")
(ListLink
(WordInstanceNode "LEFT-WALL")
(WordInstanceNode "humans@52c30b4d-5717-47cb-822d-b2caa44f94b9")
)
)
(ParseLink
(ParseNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0" (stv 1.0 0.9417))
(SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
)