Word sense disambiguation

From OpenCog

Jump to: navigation, search

In OpenCog, word sense disambituation is one of the tasks performed as part of the natural language processing pipeline. Currently, OpenCog implements a minor variant of the Mihalcea all-words graph algorithm (xxx need ref here). The code is in the opencog/nlp/wsd directory, and the opencog/nlp/wsd/README file provides additional details.

Contents

Current implementation

The current algorithm implements what is more or less a Markov chain, and solves it with the Page-Brin page-rank algorithm. In brief: every word in a sentence is tagged with every possible word-sense. At this time, these senses are taken from WordNet. Next, pairs of senses are linked together by bonds, demonstrating their affinity for each other. The strength of these bonds are computed using the WordSense::Similarity package, in the manner described by Mihalcea etal.. This results in a graph, where word-senses are the vertices, and the bonds are weighted edges. This graph thus has a natural interpretation as a Markov chain, and can be solved as such. After being solved, the most likely senses are assumed to be the correct senses.

At this time (2009), the Mihalcea algorithm represents more or less the state of the art in WSD processing.

Planned extensions

The graphical algorithm can be readily extended in a variety of ways, to further improve accuracy. Some of these extensions are given below.

Sense from syntax

Sometimes, the sense of a word can be determined, or at least, narrowed down, based on its syntactic usage. If this has been done, then new vertices are created to represent these "preferred senses". They can be linked by strong bonds to the possible word senses, thus providing additional votes. In addition, one may choose to remove the non-preferred senses from the graph, or, better yet, create strong "anti-" bonds to the non-preferred senses, so that they repel linked senses.

Reference resolution, anaphora resolution

Input from anaphora resolution algorithms can be used to create connections between references within a sentence or between sentences. The idea is to constrain meanings so that a given collection of references all carry the same meaning.

Bond generalization

There is no reason that the bonds connecting sentences need to be just plain weighted edges. They can be replaced or supplemented by far more complex structures, such as the results of ontological reasoning. That is, if certain word senses can be ruled out, or confirmed, by means of reasoning, additional links can be added to strongly vote for these preferred/most-likely senses.