Unsupervised Language Learning Experimentation

From OpenCog
Jump to: navigation, search

A general approach to unsupervised learning of language (grammar and aspects of semantics) from large text corpora was given in the paper http://arxiv.org/abs/1401.3372 by Linas Vepstas and Ben Goertzel.

This page gathers some notes on experimentation along the lines of that paper. Originally written by Ben Goertzel, March 2015.


Some Partially-Baked Thoughts, March 2015

Here are some more recent musings on approaches unsupervised language learning, in the spirit of the above-linked document on the topic by Linas Vepstas and myself, but a bit different in some of the details.

This is all still very blue-sky-ish at this point, so it’s hard to tell which detailed approach will work best.

The approach I will sketch here has four stages, which may be repeated:

  • Stage 1: Automated learning of tags via clustering words based on

feature vectors

  • Stage 2: Automated learning of link grammar entries via MOSES
  • Stage 3: Automated learning of RelEx type rules via pattern mining
  • Stage 4: Automated learning of RelEx2Logic type rules via pattern

mining relative to a corpus of logical interpretations of example sentences, obtained via a nonlinguistic source

Each of the stages 2-4 can be used to generate new, enriched feature vectors for words, thus allowing Stage 1 to be re-done again and the series of following steps to also be re-done again.

STAGE 1: Automated learning of tags via word clustering

A reasonable stab at this first phase may be obtained by imitating the paper

http://www.aclweb.org/anthology/P10-2040

The basic idea is to alternate k-means with dimension reduction.... They use SVD for dimension reduction which is not a terrible idea, but it may not matter what dimension reduction algorithm one uses (I guess ultimately both SVD and OpenCog's dimensional embedding algorithm should be experimented with)

A Variation

Another thought I had is that (based on some discussions with Aaron), instead of building context vectors based on which words occur IMMEDIATELY before and after word w in a sentence, maybe we should try building the vectors based on which words occur ANYWHERE before and after word w in a sentence. This may coincide better with link grammar, conceptually. In this case the feature vector characterizing word w would be the concatenation of Vw1 and Vw2 where

Vw1 = a vector in which the entry for word v indicates P(v occurs
before w in sentence S | w occurs in sentence S)

Vw2 = a vector in which the entry for word v indicates P(v occurs
after w in sentence S | w occurs in sentence S)

Note that

P(v occurs before w in sentence S | w occurs in sentence S) =
probability that v is a potential target for a left-going link from w


P(v occurs after w in sentence S | w occurs in sentence S) =
probability that v is a potential target for a right-going link from w

To get fancy we could weight these and do something like

Vw1 = a vector in which the entry for word v indicates: 
SUM_k{ W(k) * P(v occurs k words before w in sentence S | w occurs in sentence S) }

for appropriately normalized decreasing W(k)

Vw2 =  similar

which uses k-means clustering on dimensionally-reduced vectors.

STAGE 2: Automated learning of link grammar entries via MOSES

Now I’m going to explain how I believe we can use MOSES for learning link grammar dictionary entries, given a tagged corpus. Initial experimentation on this could be done with a corpus tagged via some standard, available tagger. Then further experimentation could be done using tags that were learned via unsupervised learning (e.g. clustering).

FIRST PASS

In the simplest approach, there would be a separate MOSES learning run for each tag.

The MOSES program trees would have

  • leaf nodes denoting syntactic category labels
  • internal nodes denoting LG-disjunction or (plain old) conjunction, and with costs associated to them (assume a cost of 0 is a good match, and a cost of 1 is maximum)
  • a fixed “conjunct of disjuncts” format

We would perhaps need to make custom Reduct rules for the LG-disjunction, to take into account the fact that it’s noncommutative

The fitness function would comprise a behavioral score and an Occam penalty. We should experiment with both a multiplicative and additive penalty.

The behavioral score would assess the average, over all sentences in which the tag T occurs, of the degree to which the program tree P finds a match in S. The degree to which P finds a match in S is calculated as

1 - discord(P,S)

where

discord(P,S) = p-power average, over all conjuncts C in P, of discord(C,S)

[as p gets larger, this approaches the maximum discord over any conjunction]

where

discord(C,S) =  the minimum cost associated with any term t in C that
has a match in the sentence S

The Occam penalty term would be zero for a program of infinite size, and a positive constant c for a program of minimal size (2). Both linear and sigmoidal penalties may be worth experimenting with.

SECOND PASS

Once the first pass has been done and we have dictionary entries learned for every word, then we can try a second pass.

Here, things are done similarly to in the first pass, but the behavioral score of the program P for the tag T assesses the average, over all sentences in which the tag T occurs, of the degree to which the program tree P finds a match in S — where matching is calculated differently. Here matching is calculated in terms of the cost of the minimum-cost legal parse that is found for the sentence, using the program P as the dictionary entry for the tag T, and using the first-pass dictionary entries for the other tags in the sentence.

Note, the second pass could be tested initially via using the hand-coded link grammar dictionary in place of a learned first-pass dictionary. This would test the second-pass learning approach.

THIRD PASS

A third pass could be done, identical to the second pass but using the first-pass dictionary entries as the basis for match calculation.

VARIATION: RULES FOR WORDS

The above process could be done for each word rather than for each tag. The rules learned (the programs P evolved by MOSES) would still have leaves involving tags, but a separate program P would be learned for each word.

Potentially, the rule learned for the tag T to which the word W has been assigned, could be used to seed the learning in the MOSES run for learning W’s rule. In this way W’s rule would be learned as a variation on the rule for the tag T.

VARIATION: RULES INVOLVING WORDS AS FEATURES

In this variation, when assessing the match of a program P for a word W in a sentence S, attention is paid to the other words in the sentence S, not just to the tags of the other words in the sentence S.

For this variation, one would expand the MOSES program trees in use, allowing them to have words as well as tags at the leaves.

One would then seed the MOSES run for a word, using the best model learned for that word using only tags as features. Depending on the Occam penalty and other parameters, new models may be learned that include specific words as well as tags as features.

FURTHER FEEDBACK

Of course, once one has a grammar, one can associated words with feature vectors whose entries are (link type, word) pairs drawn from parses of a corpus. One can then cluster these feature vectors, obtaining new word categories and a new tagging. Given this new tagging, all the above stages may be repeated. Then new feature vectors may be obtained again, etc.

STAGE 3: Automated learning of RelEx type rules via pattern mining

Given a collection of link parses formed by parsing a corpus via a link grammar dictionary, one can automate the formation of relationships that loosely resemble the current RelEx output. Basically one can perform pattern mining on the corpus of parsed sentences and find common combinations of linkages (e.g. x —S— y -O- z). These combinations can then be assigned feature vectors based on what tags or words tend to link to them. If two different combinations, each involving k words, tend to have very similar feature vectors, then they can be associated with the same k-ary relationship. In this manner RelEx-type relationships can be extracted unsupervised from a link parse corpus.

Of course, the RelEx-type relationships associated with a word or tag can then be used as features for that word or tag, to be fed back into (Stage 1) clustering.

STAGE 4: Automated learning of RelEx2Logic type rules

To learn mapping of RelEx-type relationships into PLN type relationships, as RelEx2Logic accomplishes currently (in limited cases), it seems that pure information extraction from linguistic corpus data may not be enough. For this stage it may be necessary to correlate simple linguistic expressions with logical expressions obtained from some non-linguistic source.

Given a corpus of (phrase or sentence, logical expression) pairs, one can perform pattern mining via searching all small combinations C of RelEx relations, and for each such pair C mining the set E of logical expressions corresponding to phrases or sentences containing C, to see what are the common elements of E. These common elements can be taken as the output of the RelEx2Logic rule for C.

Of course, the PLN-type relationships associated with a word or tag can then be used as features for that word or tag, to be fed back into (Stage 1) clustering.