Linas NLP tasks
This page lists a set of half-baked ideas and tasks regarding NLP (natural language processing) within OpenCog. The goal of this page is to help User:Linas more clearly formulate a strategic direction, and hew to it.
The ideas page contains various interesting NLP ideas.
Intermediate-term goals are to synthesize the 'best-practices', 'best-ideas' from construction grammar and dependency grammar together with OpenCog's current parser, based on the link grammar, into one. The synthesis is to make use of the general principles of coprus linguistics, that is, mining statistical information to obtain concrete parse rules/construction rules.
Current focus is on using mutual information as a basic measure of statistical relatedness, and then using either Markov chains, Markov random fields or Markov logic networks to obtain the colligations, head words, phrases, parse rules, co-reference resolution, and ultimately, hopefully the semes/lexis of the text. The three Markovian systems are listed in terms of increasing complexity; the intent it to only apply the simplest model needed to resolve a particular situation.
- Collect statistics for how often each disjunct parsing rule in link-grammar is used. (Dec 2008, done, have DB of millions of disjuncts)
- Collect statistics for how often pairs of such rules are used together.
- Search for clusters of words that use such rules and such pairs of rules. (preliminary experiments run. Lacking good, high-performance open-source clustering s/w).
The goal of the above is both to expand and also to refine the coverage of the parser. The dictionaries have large collections of words that have been pre-assigned to disjuncts of parse rules. Some of these collections seem to be over-broad. It is quite likely that some of these larger dictionary entries will split into smaller groups. However, these can only be found by means of corpus linguistics.
Some preliminary experiments with clustering shows that similar words can be grouped together. There seems to be a lot of interesting, suggestive data is the statistics DB; its contents are more-or-less unexplored.
- Collect statistics on word-sense usage vs. parse rule employed. (As of Jan 2009, a large database of link-grammar disjuncts tagged with word-net word senses has been collected).
The goal here is to discover how parse rules are associated with which word senses. It is very clear that certain word senses are used in only in certain grammatical settings. However, the database appears to contain a lot of interesting relations that have not been discovered, in part because of a lack of appropriate clustering s/w.
Here is a rather long blog entry explaining how things are being done, and what has been found: http://brainwave.opencog.org/2009/01/12/determining-word-senses-from-grammatical-usage/
Yuret's(1998) algo tries to create a dependency tree by computing mutual information (MI) between word pairs. The tree is discovered by computing the maximum spanning tree of the MI between all word pairs. There is an alternate approach: true, hierarchical clustering. In hierarchical clustering, one creates an MI-based metric (MI alone is not a metric), and applies cluster analysis techniques. To get hierarchy, one also looks for and computes metric measures between clusters i.e. one computes the MI between a third word, and a word-pair, instead of just the third word, and the head-word of the pair phrase.
There has been a considerable amount of work done in this area' googling "minimum spanning tree parser" will provide hits to some of this work. These kinds of parsers typically have among the best scores for automatically learning new languages. Also try googling "unsupervised dependency parser".
The phrases "X wrote Y" and "X is the author of Y" are more or less synonymous phrases. Lin and Patel (Lin2001) describe a method for automatically discovering such synonymous phrases. The goal of this task is to implement this paper.
- (Lin2001) Dekang Lin and Patrick Pantel. 2001. "Discovery of Inference Rules for Question Answering." Natural Language Engineering 7(4):343-360)
- This is not a valid GSOC project. Its just a note to myself.
Consider the word pairs "motorcycle crash" (mutual info MI=7.27962) and "motorcycle accident" (MI=7.86758). Compare this to MI(motorcycle, [crash or accident]) = 7.66015 which is higher than the average (MI(moto crash) + MI(moto accident))/2 = 7.57361 So, clustering together the two words, "crash, accident", raises the MI just a tad over the average.
By contrast, this doesn't hold for morphology, so MI(motorcycle, [club or clubs]) <(MI(motorcycle club) + MI(motorcycle clubs))/2. Hmmmm. That's odd ...
- Experiment was performed 4 Sept 2008, dead end. Source code in directory lexical-attr/src/cluster.