Grammatical correction - GSoC 2008

From OpenCog
Jump to: navigation, search

Abstract

This proposal is to further develop RelEx and its use of Link Grammar. Currently, RelEx's interpretation of ungrammatical sentences (such as that often seen in chatrooms) is not entirely satisfactory. This project is to investigate how Link Grammar can be better used to be able to parse ungrammatical and idiosyncratic sentences, and therefore to improve RelEx's analysis of such sentences.


Status

A second script is being developed to extract certain guessed words from Link Parser output and add them to a dictionary.

Problems noticed

1. Doesn't recognise words: that are supposed to be capitalised, but aren't; abbreviations, slang not in the dictionary; where apostrophes and necessary hyphens are left out of a word.

Examples:

whats the opposite of mercy

But noone knows the word milton ftw


2. Incomplete sentences are difficult, particularly when parts (like the subject!) are assumed from context. Also, one-word sentences.

Examples:

milton ftw

Brought her to the bowels of the ITEE building to study

okay


3. Long, run-on sentences

Examples

Three huge pillars of flame + 10-14 year old pyromaniacs + 10-14 year olds afraid of fire + one 12 year old who claimed to be afraid of fire but couldn't be pulled away from it and had a really scary look in her eyes = BAD, no matter the challenge


Generally good examples of parsing problems

These are some example sentences which are not grammatically complete, though they may be quite close to being so. Link Parser output is given for each.

whats the opposite of mercy

No complete linkages found.
Found 1 linkage (1 had no P.P. violations) at null count 6
        Unique linkage, cost vector = (UNUSED=7 DIS=0 AND=0 LEN=0)

[whats] [the] [opposite] [of] [mercy]


milton ftw

Found 1 linkage (1 had no P.P. violations)
        Unique linkage, cost vector = (UNUSED=0 DIS=0 AND=0 LEN=2)

    +----Wi----+----Os----+
    |          |          |
LEFT-WALL milton[?].v ftw[?].n

While the parser does make guesses about the words and their link, the result is quite wrong. 'milton' is a noun, and 'ftw' I think is used as a verb, in a way.


and kept a horde of 10-14 year old girls from setting themselves on fire yesterday evening

No complete linkages found.
Found 8 linkages (8 had no P.P. violations) at null count 3
        Linkage 1, cost vector = (UNUSED=3 DIS=6 AND=0 LEN=35)

           +-----------------------Wdc----------------------+
           |                +---------------AN--------------+
           |                |           +---------AN--------+---------
           |                |           |      +-----AN-----+
    +--Wc--+                |           |      |     +---A--+--Mp--+--
    |      |                |           |      |     |      |      |
LEFT-WALL and [kept] [a] horde.n [of] 10-14 year.n old.a girls.n from


--------Sp-----------------+
    +--------K--------+    |
Mgp-+-----O----+      |    +--MVpn-+----TD---+
    |          |      |    |       |         |
setting.v themselves on fire.v yesterday evening.i

None of the given solutions use the words 'kept' and 'a', and they all contain similar linking flaws.


Three huge pillars of flame + 10-14 year old pyromaniacs + 10-14 year olds afraid of fire + one 12 year old who claimed to be afraid of fire but couldn't be pulled away from it and had a really scary look in her eyes = BAD, no matter the challenge

No complete linkages found.
Found 61594 linkages (9 of 1000 random linkages had no P.P. violations) at null count 4
        Linkage 1, cost vector = (UNUSED=4 DIS=7 AND=1 LEN=92)

                 +---------------------------------------------------Sp------
                 |                   +---------------------------------------
                 |                   +--------------Rn--------------+
                 |                   |     +-----------AN-----------+
  +------Dmc-----+      +-----Jp-----+     |      +--------AN-------+
  |      +---A---+--Mp--+    +---AN--+     |      |     +-----A-----+-----Sp-
  |      |       |      |    |       |     |      |     |           |
three huge.a pillars.n of flame.n +[?].n 10-14 year.n old.a pyromaniacs[!].n


---------------------------------------------+
Bs--------------------------------------+    |
                                        |    |      +----------------Bs-------
   +---------Op---------+               |    |      +-------------R-----------
   |     +------Ah------+               |    |      +----------Ma---------+
---+     |      +---AN--+----Ma---+--OF-+    +--Os--+    +----NSn---+-Ytm-+
   |     |      |       |         |     |    |      |    |          |     |
+[?].v 10-14 year.n olds[!].n afraid.a of fire.v +[?].n one [12] year.p old.a


--------+
-+      |
 |      |      +------------------------------------SFsx-------------------
 +--RS--+--Ce--+-Ix+--Pa--+--OF-+-Jp-+
 |      |      |   |      |     |    |
who claimed.v to be.v afraid.a of fire.n but [couldn't] be.v pulled.v away


                +----------------------I*j---------------------+    +---------
                +-----------Os----------+                      |    |  +------
----------------+   +---------Ds--------+    +----Jp----+      |    |  |   +--
                |   |    +--EA--+---A---+-Mp-+   +--Dmc-+      +-Os-+  |   |
                |   |    |      |       |    |   |      |      |    |  |   |
from it [and] had.v a really scary.a look.n in her.d eyes.n =[?].v BAD , no.d


-----MX-------------+
------Xd------------+
--------Ds----------+
    +-------AN------+-----Xc----+
    |               |           |
matter.n [the] challenge.n RIGHT-WALL


                 +---------------------------------------------------Sp------
                 |                   +---------------------------------------
                 |                   +--------------Rn--------------+
                 |                   |     +-----------AN-----------+
  +------Dmc-----+      +-----Jp-----+     |      +--------AN-------+
  |      +---A---+--Mp--+    +---AN--+     |      |     +-----A-----+-----Sp-
  |      |       |      |    |       |     |      |     |           |
three huge.a pillars.n of flame.n +[?].n 10-14 year.n old.a pyromaniacs[!].n


---------------------------------------------+
Bs--------------------------------------+    |
                                        |    |      +----------------Bs-------
   +---------Op---------+               |    |      +-------------R-----------
   |     +------Ah------+               |    |      +----------Ma---------+
---+     |      +---AN--+----Ma---+--OF-+    +--Os--+    +----NSn---+-Ytm-+
   |     |      |       |         |     |    |      |    |          |     |
+[?].v 10-14 year.n olds[!].n afraid.a of fire.v +[?].n one [12] year.p old.a


--------+
-+      |      +------------------------------------SFsx-------------------
 |      |      |                                                 +----MVp--
 +--RS--+--Ce--+--------------------Ix--------------------+--Pv--+--MVp-+
 |      |      |                                          |      |      |
who claimed.v to be.v afraid.a of fire.n but [couldn't] be.v pulled.v away


                +----------------------I*j---------------------+    +---------
----------------+-----------Os----------+                      |    |  +------
--+             |   +---------Ds--------+    +----Jp----+      |    |  |   +--
  +-J-+         |   |    +--EA--+---A---+-Mp-+   +--Dmc-+      +-Os-+  |   |
  |   |         |   |    |      |       |    |   |      |      |    |  |   |
from it [and] had.v a really scary.a look.n in her.d eyes.n =[?].v BAD , no.d


-----MX-------------+
------Xd------------+
--------Ds----------+
    +-------AN------+-----Xc----+
    |               |           |
matter.n [the] challenge.n RIGHT-WALL

The parsing is not particularly improved by changing + to 'and' and = to 'makes'. (Of course, it sometimes takes a second reading for a human to parse sentences like this.)


Note: all examples are actual ones, largely from instant messaging conversations, not made up or altered in any way.

Ideas

As dealing with a lot of these problems might very well slow down or hinder the parsing of simpler, grammatically correct sentences, I expect that it will be best to have a sort of 'informal mode', which may be turned off when such informal pieces of writing are not going to be dealt with.

New dictionary entries will need to be added for common slang and abbreviations. How these are to be dealt with is something that needs further consideration.

If a word is not found in the dictionary in its exact form, a non-case-sensitive attempt to find it in the dictionary should be made. A similar thing could be done for words that should have internal punctuation, but it may be more efficient to include the unpunctuated forms in the dictionary, perhaps marked as not being standard grammar.

When a subject is left out: if the entire conversation is being parsed, it may be possible to keep track of main subjects of sentences and insert the previous one when one is left out of a sentence; a dummy subject could be used. The difficult part is telling when a subject is left out.

Long, run-on sentences: split them into manageable chunks. Punctuation and conjunctions may be useful for this.

Punctuation correction

There are several very common words that are commonly mis-punctuated in informal typing. The pronoun "I" is commonly left uncapitalised, and apostrophes are often left out of contractions and possessives.

A basic script has been written that will correct certain of these issues in a given text.

What the script corrects

The aim in the choosing of which issues to correct has been to start with those that are most common, most harmful to parsing when unrecognised, and only to correct things that are unambiguously wrong. Therefore the initial corrections that the script makes are to such mis-punctuations as "i" for "I", "dont" for "don't", "theyre" for "they're", etc. I consider pronoun-verb contractions (especially with the verb "to be", such as "they're") and "-n't" verb contractions (such as "aren't") to be the contractions that are most commonly mis-punctuated and that create significant parsing problems, therefore these are the ones that have been initially dealt with.

There is plenty of room and scope to add additional words for the script to correct, whether individual words or new categories of words (such as other contractions, or perhaps certain words and prefixes that require hyphenation).

What the script does not correct

There are several particular words, and several classes of words, which a conscious choice has been made not to correct. These are issues that would be better dealt with by the parser, as they require knowledge of word type and context.

Words such as "she'll", "we're", "she'd" when written without the apostrophe cannot be reliably identified as mis-punctuated contractions by a script such as this, as they would be spelt identically with other words.

No words that should end in "'s" are corrected (this includes both possessives and contractions of "is"), as most of these would not be able to be reliably told apart from plurals (or in the case of "it's", from the possessive "its"). There are some cases where a missing apostrophe could be certainly identified, mostly for words that pluralise irregularly (e.g. "childs" could only be a mis-punctuation of "child's"), but this has not yet been dealt with. There are other words where a "-'s" is very common but a plural is uncommon (e.g. "whats" - "what's" is far more commonly intended than "whats"). Whether a common mistake should be corrected at the expense of creating a new uncommon mistake is an interesting question.

Mis-capitalisation is similarly an interesting issue. It would be fairly easy to ensure that no letter other than the initial one was capitalised, but this would then render abbreviations and the other proper nouns that do contain internal capitals incorrect. Some thought has been given to this issue, and I've decided that for now the script will not correct capitalisation other than for the word "I", though that may be changed at some point.

Example sentences

Here are some examples of how corrections made by the script can improve Link Parser output.

Original sentence: im still here

No complete linkages found.
Found 1 linkage (1 had no P.P. violations) at null count 1
        Unique linkage, cost vector = (UNUSED=1 DIS=0 AND=0 LEN=4)

    +---Wi---+------MVp-----+
    |        |              |
LEFT-WALL im[?].v [still] here

Corrected sentence: I'm still here

Found 2 linkages (2 had no P.P. violations)
        Linkage 1, cost vector = (UNUSED=0 DIS=0 AND=0 LEN=4)

 +-SX+-Pa-+--MVp-+
 |   |    |      |
I.p 'm still.a here

Original phrase: so i dont want to practise

No complete linkages found.
Found 2 linkages (2 had no P.P. violations) at null count 2
        Linkage 1, cost vector = (UNUSED=2 DIS=1 AND=0 LEN=9)

                  +---------MVi---------+
    +------Wi-----+---Os--+             +---I--+
    |             |       |             |      |
LEFT-WALL [so] i[?].v dont[?].n [want] to practise.v

Corrected phrase: so I don't want to practise

Found 1 linkage (1 had no P.P. violations)
        Unique linkage, cost vector = (UNUSED=0 DIS=0 AND=0 LEN=6)

    +--Wc--+Wd+Sp*i+--I*d-+-TO-+---I--+
    |      |  |    |      |    |      |
LEFT-WALL so I.p don't want.v to practise.v

Known issues

There are two known cases where a correction of a highly frequent mistake has been made that will lead to creating a new highly infrequent mistake. They are as follows:

  • Where "im" or "Im" is used, and is intended to be an abbreviation, such as of "instant messaging", it will be incorrectly changed to "I'm".
  • Where "i" is not intended to be the pronoun, but is instead a mathematical or other kind of variable, it will be incorrectly interpreted as the pronoun.

In both cases, the mis-punctuation causes severe parsing issues, and the mis-punctuation is so very common and the alternative is so uncommon that I believe it is worth it to make the changes. However, the capitalisation of "i" will cause more problems than it fixes in certain contexts, such as more formal writing where the pronoun will generally be capitalised, and in highly mathematical writing.

It may be the case that other words I have interpreted as mis-punctuated contractions are in fact words or abbreviations unknown to me (I did check the dictionary for any that looked like they might be words, but abbreviations are harder to determine).

Script structure

The script is written in Python. It currently takes an input text file and produces a new, corrected text file - this will be changed when it is determined what the best way to use it will be.

The currently corrected word types are:

  • "-n't" contractions (or "- not", e.g. "isn't", "shouldn't"). Deals with all such words I could find in the English dictionary the link parser uses.
  • "-'d" contractions (of "- would" or "- had", e.g. "he'd", "what'd"). Only deals with pronouns (personal and relative/interrogative) so far.
  • Contractions of "to be" (e.g. "they're", "they'll", etc.). "I'm" is dealt with, but specially. Only deals with pronouns (personal and relative/interrogative) so far.
  • Contractions of "to have" (i.e. "they've" etc.). Contractions of "had" are dealt with along with other "-'d" contractions. Contractions of "has", like all "-'s" contractions, are not dealt with.
  • Capitalisation of various forms of "I" (e.g. "I'm" etc.).


Guessed dictionary script

A simple script has been written which identifies guessed words in Link Parser output, and adds certain of these words to a dictionary.

Guessed word selection

Not all words which the Link Parser has guessed a type for will be added to the dictionary. A word guess will only be kept if throughout the entire given Link Parser output that word is always guessed to be the same type of word. If a word occurs only once, it will be kept. If it occurs twice, once guessed as a noun, and once guessed as a verb (for example), it will not be kept.

Script structure

This script is also written in Python. It takes a text file of Link Parser output, and creates or adds to a dictionary file specifically for guessed words.