RelEx compact output
The RelEx Web Crawler project requires a high-quality, comprehensive format for presenting the results of natural-language parsing on a large scale. The intent of the web crawler project is to parse large parts of the world-wide web, and to make the results publically available in a "technology neutral" format to all interested researchers and hackers. It is hoped that access to large quantities of pre-parsed text will enable new kinds of large-scale statistical corpus analysis, by offsetting the overwhelmingly immense amounts of compute time needed to parse a large amount of text.
To make the format useful to a wide array of users, the goal is to present the data in four complimentary formats: a list of tags and features (such as part-of-speech, count, tense, etc.) for each word, a Penn tree-bank style constituent tree, a dependency-grammar style list of relations (such as subject, object, etc.), and a Link Grammar style list of links. Each of these formats has its strengths and weaknesses; different algorithms often require a different one of these formats as input, and synergy can be obtained in algorithms that consider several of these at once.
The current effort is focuses entirely on the RelEx relation extractor, which provides a dependency-grammar-like markup based on the output of the CMU Link Grammar parser. It is hoped that the format might be general enough so as to be useful for other parser systems as well. The RelEx/Link-Grammar combo may not be the speediest parser out there, but it should offer possibly the most accurate results (for the English language only), of any free, publicly available parser.
A collection of pre-parsed texts, including the English Wikipedia, are available at http://gnucash.org/linas/nlp/data/
The output format needs to satisfy the following goals:
- Be relatively compact, storage-wise. The result of English-Language parsing is typically much larger than the English-language input. The excess size should be minimized.
- Be easily human readable, to allow quick review for correctness.
- Be machine readable with little effort; avoid fancy markup; be close to existing, entrenched formats.
- Provide parse meta-information, such as date of parse, parser version numbers, scoring and ranking info.
The file format should be "easy to rip", i.e. easy to create ad-hoc utility scripts to process the data. The use of a formal, formally-correct language is not required.
The format must provide the following output:
- Provide a version number identifying the output format.
- Provide a version number identifying the RelEx and link-grammar versions. (Newer RelEx/link grammar version will in general provide higher quality/more accurate output, and so it is useful to know whether the output came from an older or buggier version of RelEx/link-grammar)
- Processing date (the date when this output was generated).
- The location (URL or URI) of the original text.
For each sentence in the text, the following output needs to be provided:
- Identifier indicating which sentence this is, in the text. (The meaning of sentences depends on the order in which they appear.)
- The total number of parses generated by link-grammar.
- The output of the first N parses, where N should probably be 4.
For each parse, the following needs to be returned:
- Parse ranking scores, as returned by link-grammar.
- All Link Grammar link pairs. These have many uses, such as providing input to mutual information (entropy) corpus statistical analysis.
- All RelEx word properties (features), such as part-of-speech, number, tense, gender, etc. Words must be appropriately tagged, so that each word instance is uniquely identified.
- All RelEx binary relations (dependency relations), such as subject, object, prepositional relations, etc. Words must be appropriately tagged, so that each word instance is uniquely identified.
- The constituent tree, i.e. the Penn Tree-Bank style parse tree. Words do not need to be tagged, as word order is implicit in the parse tree.
- (Optional) Identify head-words of phrases.
The output format does not need to supply the following:
- The original sentence. This can be reconstructed from the constituent tree.
Note that the current RelXML output format fails to deal with many of the above requirements; it is verbose and occasionally obtuse. Possible candidates are:
- The TigerXML treebank encoding format. Pros and cons:
- Self-describing, with head for including meta-information and body for the actual parsed text.
- Has schema.
- Requires relation types to be declared in advance, as part of the header: this leads to major difficulties for RelEx, which generates prepositional relations on the fly. Since these are not known in advance, they cannot be collected until parsing is over, which in turn would require two parse passes, or some other means of buffering the data.
- Difficult to read.
- The CoNLL format, possibly with extensions/wrappers to meet above requirements. See also the main site, and an example, in Dutch. Some pros and cons of the CoNLL format:
- Nice and compact for representing lemmas and features.
- Dependency relations are (very) hard to read, and are thus not immediately debuggable.
- No support for query variables, comparison variables or constituent tree.
- No support for link relations.
- No support for multiple parse candidates.
- No support for corpus and parser metadata (parser version, ranking, etc.)
- The eXGL format, a generic format for representing graph structures in terms of nodes and links. Pros and cons:
- Based in GXL, the graph exchange language.
- Extremely verbose
- Difficult to read.
- The WordHoard Part of Speech XML File, an interesting way of representing many/most features in an abbreviated way. Similarly the WordHoard word class file is also suggestive.
- The TEI Text Encoding Initiative. While TEI does not provide a format for the specific task at hand, it does provide general text markup formats and guidelines that should be observed.
- A new XML-based format, designed to meet the above requirements.
- A new JSON-based format, designed to meet the above requirements.
- A new Turtle (Terse RDF Triple Language)-based format.
- A new YAML-based format.
The best solution currently seems to be a new XML/JSON format, embedding within it the CoNLL-like format for lemmas and features. Both the dependency relations, and the link-grammar links would be listed as collections of triples. Also worth considering/reconsidering: Turtle, Notation 3, N-triples, or vanilla RDF. See also comments/critique at end of this page.
Annotated example of proposed format
Below follows an example of the new proposed format (under construction, incomplete).
Note that the format is not "pure" XML; there is also structured data within the document itself. The goal of using such a mix of XML and non-XML structured data is to give a nod to traditional NLP formats, and to avoid the excessive bloat and overhead from requiring every last bit of data to be XML-encoded.
<?xml version="1.0" encoding="UTF-8"?> <!-- File format version embedded in namespace name. --> <nlparse xmlns="http://opencog.org/RelEx/0.1.1"> <!-- Parser versions, tab-separated. --> <parser>link-grammar-4.3.5\trelex-0.9.0</parser> <!-- Parse date, iso8601 format. --> <date>2008-06-27 23:47Z</date> <!-- Location of the original source document. --> <source url="http://www.gutenberg.org/extext/74"/> <!-- Sentence number 1 in the document had 4 parses. --> <sentence index="1" parses="4"> Most of the adventures recorded in this book really occurred. <!-- The parse id is a simple count, starting at 1 --> <parse id="1"> <!-- Link-grammar specific ranking information --> <lg-rank num_skipped_words="0" disjunct_cost="0" and_cost="0" link_cost="20" /> <!-- The constituent tree is represented in the -- standard notation commonly used in computational -- linguistics, namely, an S-expression --> <constituents> (S (NP (NP Most) (PP of (NP (NP the adventures) (VP recorded (PP in (NP this book)))))) (VP (ADVP really) occurred) .) </constituents> <!-- CoNLL-style lemma, part-of-speech and feature -- tags, tab-separated, newline-terminated. --> <features> 1 most most noun 2 of of prep 3 the the det 4 adventures adventure noun plural|definite 5 recorded record verb past 6 in in prep 7 this this det 8 book book noun singular|definite 9 really really adv 10 occurred occur verb past 11 . . punctuation </features> <! -- List of relation triples, newline-separated. -- The index of the word in the sentence is in -- square brackets. The word itself is optionally -- included for readability. --> <relations> _advmod(really, occur) in(record, book) _obj(record, adventure) of(most, adventure) _subj(occur, most) </relations> <!-- Link-grammar list of links. Numerical values -- are the index of the word in the sentence; an -- index of 0 indicates the "left wall". --> <links> S(1, 10) Mp(1, 2) Jp(2, 4) Mv(4, 5) MVp(5, 6) Js(6, 8) Dsu(7, 8) Em(9, 10) Dmc(3, 4) Wd(0, 1) Xp(0, 11) </links> </parse> </sentence> </nlparse>
Below follows the detailed specification (incomplete, under construction). All elements are optional, and can be omitted. It is, however, strongly recommended that all elements be present, as the absence of one or another can severely limit the utility of generated output.
Top-level stanza. Attributes:
- Indicates the version of the specification. currently has the value http://opencog.org/RelEx/0.1.1.
Identification string for the parser version. The explicit format for the string is not specified, however, there is a strongly recommended format. The parser should be identified by means of a string, of the form parsername-X.YY.ZZ where X, YY and ZZ are the major, minor and incremental version numbers of the parser parsername. It can be useful to include version numbers for the various parts of the parser (such as the sentence detector, or the entity detector); these should be tab-separated.
Date on which the parse was made. Date is in ISO8601 format.
Date on which the document was retrieved. Date is in ISO8601 format. Because web pages and other web texts can change over time, the document date is meant to capture the version of the original text.
Location of the original source-text.
Copyright applicaable to the original source.
Copyright applicable to the parsed data.
Indicates the scope of a sentence. A given <nlparse> section may contain zero or more of these sections. A <sentence> section may enclose PCDATA (plain text), in which case this text is understood to be the original sentence. Inclusion of the original sentence is for readability and convenience only; individual parses will repeat this information, although in a less readable form. Inclusion of the original text is optional but recommended.
- Serially-issued id number, indicating the index of the sentence in the text. This number is meant to indicate the order in which the sentences occur in the text, so that sentences are ordered by index. The index is assumed to run contiguously, with no holes or gaps. The first sentence in the text should be labelled with an index of 1. It is valid to split a parse of a large text across multiple XML files, and so for any given XML file, the index may not actually start at 1, and may contain holes (so, for example, one file may contain only even sentences, and another only odd sentences.)
- The absolute meaning of this number is somewhat ambiguous, as the identification of sentences will vary amongst different types of sentence detectors.
- Offset, in bytes, to the start of the sentence in the original text. This field is optional, it is intended to mitigate the problem of different sentence detectors identifying different sentences, and thus rendering the index ambiguous.
- Indicates the number of parses that the parser found. This can be, in general, larger than the number of parses that follow.
This section contains parsed data.
This section contains link-parser specific parse-rank scoring information.
The tree-bank-style constituent tree. To-be-documented. For RelEx, see constituent tree.
List of features.
List of dependency-grammar like relations. To be documented. For RelEx, see dependency relations.
List of link-grammar links. To be documented. See Index to Link Grammar Documentation.
Known bugs: Idioms. The link-grammar parser autogenerates special links for idioms, and gives these links a unique ID, of the form IDxxxx where xxx is an alphanumeric string. These are "usually" the same, except when idioms are added to the dictionary, in which case the numbering will go completely off. Thus, idiom link types will differ from one link-parser version to another.
Experiments with the current system shows the following results, for a corpus of exactly 10K sentences (the corpus is plain text, and does not contain HMTL markup):
- original corpus is about 750K bytes uncompressed,
- original corpus is about 250K bytes gzipped.
- parsed XML format is about 36MB xml uncompressed
- parsed XML format is about 3.6MB xml, gzipped.
The result of parsing appears to blow up the size of the text by a factor of 15x, even after compression.
Comments, shortcomings, critique, suggestions
The combined use of a CoNLL-style feature table, together with a list of relations, solves a variaty of problems:
- Putting the features into a table makes them much more human-readable, as opposed to a cluttered list of tags such as pos(throw, verb) tense(threw, past).
- Removing the relations from the CoNLL table, and putting them into a list of relations also makes the result much more readable: subj(throw, Alice) obj(throw, ball) is much easier to comprehend than when expressed in a tabular format.
The above format has the following shortcomings, that should probably be dealt with:
- The columns of the CoNLL-style feature list are not labelled. This makes the format fragile with respect to future changes (such as the addition of fine-grained part-of-speech tags).
- The attributes of the lg-rank tag are highly link-grammar specific. Other parsers will need to have thier own parser-specific ranking tag.
- Add per-line xml markers, rather than depending on newline.
- Put the parser, date, etc. info into a "<meta>" section, similar to the TigerXML header section, or, better yet, as suggested by the TEI standard.
- Provide a meta-descriptive section of the kinds of tags that can appear in the treebank and other sections.
- Provide an xmlns example of html, with parse data embedded.