Learning Semantic Similarities Between Syntactic Constructs via Pattern Mining and Statistical Analysis

From OpenCog
Jump to: navigation, search


A problem in NL comprehension is recognizing similar meanings among sentences phrased in different ways. Prepositions are a particularly tricky case here. For instance,

Bob grieves for his wife.
Bob cries over his wife.
Bob grieves his wife.

have somewhat similar meanings, whereas

Bob jumps over the lake.
Bob grieves over the lake.

don't have similar meanings at all.

One way to grapple with this problem is to use statistical analysis of an interpreted (via RelEx2Logic) text corpus to identify words and syntactic structures that tend to have similar meanings.

This process is a subset of the unsupervised language learning algorithm described in http://arxiv.org/abs/1401.3372 . However, it seems of value on its own, apart from the context of unsupervised language learning, simply as a method of creating categories and SimilarityLinks to be used in reasoning about Atoms derived from text via RelEx2Logic.

A Method for Deriving Semantic Similarities Between Syntactic Constructs

The method I'm thinking of is, roughly, as follows.

Define a linguistic-semantic (LS) pattern as a sub-hypergraph of the Atomspace whose nodes are all either: a) variables, or b) ConceptNodes referring to WordNode, or senses of WordNodes.

A simple example would be

EvaluationLink
    PredicateNode [1]
    ConceptNode [2]
    VariableNode [3]

ReferenceLink [1] (WordNode "eat")
ReferenceLink [2] (WordNode "person")

This LS pattern denotes, conceptually: Things that people eat.

We can mine frequent LS patterns from the Atomspace using the frequent subhypergraph miner. For instance, in an Atomspace derived from parsing typical corpus, a "Things that people eat" pattern will be frequent, a "Things that gramophones eat" will not.

We can then use these patterns to

  • divide words, and LS subgraphs, into semantic categories.
  • estimate similarities between words, and between LS subgraphs

This would work as follows:

CATEGORIZATION

  • Assign two ConceptNodes to the same category if they occur surprisingly often in the same argument-slot of the same LS pattern, or of LS patterns belonging to the same category
  • Assign two LS patterns to the same category if they, surprisingly often, have ConceptNodes from the same category in a particular one of their argument slots (or in a particular two of their argument slots, etc.)

SIMILARITY ESTIMATION:

  • Increase the degree of IntensionalSimilarity between two ConceptNodes if they occur surprisingly often in the same argument-slot of the same LS pattern, or similar LS patterns
  • Increase the degree of ExtensionalSimilarity between two LS patterns to the extent that the have intensionally similar Atoms in one (or more) of their argument slots

Both categorization and similarity estimation seem valuable here.


Conceptual Example

For example, suppose we have a large corpus containing sentences like:

She grieved her husband.
The woman grieved for her husband.
Bob grieved for his wife.
Husbands complain a lot.
Wives whine a tremendous amount.
Old ladies complain constantly.
Old ladies whine incessantly.
Sally cried over her husband.
Sally cried for her husband.
She is grieving for her dead cat.
She is crying over her dead cat.
He is jumping over the fence.
He jumped over the river.
He jumped over her dead frog.  
He hopped over the fence.
He flew over the river.
The pig ate the frog.
The goat ate the cat.
Ben loves pigs.
Ben really really loves goats.
…


You would expect that the LS patterns corresponding to "grieved for" and "cried over" would get a high similarity between each other, and potentially be assigned to the same category, because they tend to take similar arguments (people tend to grieve for, and cry over, the same things). E.g. "Sally cried over her husband" and "Bob grieved for his wife" would provide evidence for the similarity of "cried over" and "grieved for", because "husband" and "wife" would be recognized as similar, due to "Husbands complain" and "Wives whine" -- which provide evidence for the similarity of "husband" and "wife" because old ladies both complain and whine, giving evidence that "complain" and "whine" are similar….

On the other hand, one would expect that the LS pattern for "jumped over" would not get a high similarity to the LS pattern for "cried over." But there may be a little similarity, because you can both jump over a dead frog and cry over a dead cat; and frogs and cats are similar because they are both eaten by things that Ben loves.