# Language learning

Learning a natural language, without supervision, is one of the AGI challenges. A proposal for how this could be done is available on ArXiv, as Learning Language from a Large (Unannotated) Corpus. There is a currently running project (in 2014) to implement this proposal in OpenCog. This page describes the current implementation and status. We hope the details below are sufficient for you to understand and contribute to this project in a meaningful way. Feel free to submit questions to the OpenCog mailing list.

The code is located in the directory /opencog/nlp/learn

Understanding the 'big picture' requires extensive reading. The following is a reading list. Pretty much everything on it is mandatory, and it is probably best to read in the order given below.

Understanding Link Grammar is central and primary. The Wikipedia article provides a reasonable introduction. The system documentation provides some additional detail. There is also the primary website. However, the original articles are the most important to read:
• Daniel Sleator and Davy Temperley. 1991. Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report CMU-CS-91-196, October 1991. PDF
• Daniel Sleator and Davy Temperley. 1993. Parsing English with a Link Grammar. Third International Workshop on Parsing Technologies. (This is a shorter but more up-to-date version of the technical report above. It contains a better introduction to link grammars, and gives a more detailed description of the relationship between link grammar and other formalisms.) PDF
The take-away concepts: Link Grammar performs parsing by starting with a dictionary or lexis containing lists of words. Each such collection of words is associated with a collection of disjuncts. In turn, each disjunct is a sequence of connectors, with the connectors determining how a word may connect, or link, to another word. Disjuncts should be visualized as jigsaw-puzzle pieces, with a word printed on the puzzle piece. A sentence parses if and only if the jigsaw pieces snap together. The jigsaw concept appears in other places: for example, this pop-sci article:
The use of the word "quantum" is not accidental; it is related to type theory and linear logic; see "supplemental reading" below.

### Mutual Information and MST Parsing

A basic grammar can be discovered by computing the mutual information between word-pairs, and then using a Maximum Spanning Tree (MST) to obtain an un-labelled parse. There are many different papers and resources on these topics, but perhaps the best and most readable is Deniz Yuret's original PhD thesis. Understanding the ideas here are centrally important.
Secondary reading involves studying various maximum entropy principles.
The Yuret MST parse of a sentence is a dependency parse, and thus resembles a link-grammar parse, except that the links are unlabelled. A big part of the current project is the automated conversion of these unlabelled MST parses into labelled disjuncts. A big part of the "magic" is that this extraction improves the accuracy; MST parses are well-known for being OK, but not great. Anyway, Yuret does NOT explain this, because he did not know; the disjunct extraction process is new and specific to this project.

### Synonyms, Antonyms

Given a dependency parse, Dekang Lin explains how to discover synonyms. Poon and Domingos generalize the process to the discovery of synonymous phrases.
• Dekang Lin and Patrick Pantel. Dirt: Discovery of inference rules from text. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01), pages 323–328. ACM Press, 2001.
• Hoifung Poon. Pedro Domingos. Unsupervised Semantic Parsing In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1–10, Singapore, August 2009. Association for Computational Linguistics.
Other writings by Lin can inform a bit more:

### Semantics from the MTT perspective

The ultimate goal is to extract meaning from text. MTT, or Meaning Text Theory (Wikipedia) offers an insightful framework for this. Aside from the DSyntR structures, and the concepts of "rheme" and "theme", another very important concept is that of the lexical function.
• James Steele, editor. Meaning-Text Theory: Linguistics, Lexicography, and Implications. University of Ottowa Press, 1990.
Then there is this 410-page monster; it is supplementary reading; I haven't read it myself:

### Semantics from other perspectives

While MTT gives a general structural means to think about meaning, there are other illuminating ways to think about it as well. One is given below:
The above is particularly interesting and important, if/when one realizes that one can replace WordNet and PageRank, which are central to the above, by something like PLN, so that one is performing short-distance reasoning between phrases or N-grams or sentences fragments, and then re-assembling these local reasoning snippets, and selecting the most likely global interpretation for a sentence. It is here where concepts of logic and reasoning and memory and knowledge start becoming important.

• Type theory is important, because the Link Grammar connectors, links and disjuncts are best understood as types.
• The first 2-3 chapters of the Homotopy Type Theory book give a good overview of types, and what they are.
• John Baez, Mike Stay, Physics, Topology, Logic and Computation: A Rosetta Stone (2009) gives a fairly direct overview of category theory and monoidal categories. It does not talk about natural language, but it does provide the bridge as to why "quantum" sometimes shows up in natural language. The thing to look for here is the "tensor monoid" or monoidal category. This category just-so happens to have the same structure as Link Grammar, in that the disjuncts occurring in the lexical entries of the Link Grammar dictionary are described by linear logic. There are additional connections to proof theory, modal logic, and intuitionistic logic.
• The pregroup grammar provides the most direct, immediate and concrete articulation of the connection between monoidal categories and natural language. Note that pregroup grammars are more-or-less the same thing as Link Grammar, modulo details that kind-of don't matter. Some of the papers cited there provide more.

The current software system requires understanding the following computer technologies:
• Its critical to understand scheme, because all of the processing code is written in scheme.
• Its absolutely critical to understand what Atoms are, what the AtomSpace is, and the RelEx OpenCog_format, which details how words and sentences are represented with atoms.
• The data is stored in an SQL database, but its not important to know that, as its not accessed directly. It could be slightly useful to know C++, because OpenCog is written in that; but this is not really important, because these parts hopefully won't need any new hacking.

There are diaries, status reports and README's specif to this project.
• The README explains installation, setup, operation.
• The sheaves PDF describes an important theoretical foundation for much of this work.

## Steps to learning

The process of learning requires a number of processing steps. These are reviewed below, but also discussed in greater detail in this README file.

• Create one or more Dockerfiles that automate the process below. Status: not started. you must do the steps by hand.
• Download and cleanup corpus. Mostly, wikipedia is used. The raw wikipedia pages need to be cleaned and scrubbed. There are scripts here (part of relex) that do this, for English, French, Polish, Lithuanian, Tagalog, Chinese. Status: Create one for your language as needed.
• Setup infrastructure:. Download, compile, setup OpenCog. Install, setup PostgreSQL. Hook up Postgres to OpenCog. Download, compile Link Grammar. Download, compile RelEx. Set them up. Status: there are existing Docker files that can sort-of help with some of this. More documentation is needed.
• Process the corpus: for English-language learning: Run the text corpus through the the Link Grammar parser, via RelEx. Scripts for this can be found here (in opencog). The processing parses sentences with the ANY langague, which generates completely random parse trees. These random parse trees are used to collect statistics. Status: The scripts work, may require fiddling.
• Morphology processing: for non-English languages with concatenative morphologies: Run the text corpus through the the Link Grammar parser, via RelEx, using either the ADY or the AMY (but not the ANY) "language". The ADY parser randomly splits words into pairs of morphemes; the AMY parser splits words into two or three random morphemes; both then create a random parse tree. Thus, it is similar to the above, except that the morpheme boundaries are not yet known. The ADY and AMY parsers are defined here (part of link-grammar) and here. Note that the AMY parser is not entirely stable, and is undergoing active development no (Feb 2017)
• Word-pair counting: collect statistics for nearby pairs of words. This is needed to get an idea of what words typically are near each other, and what order they typically come in. Status: Done, and works well. The pipeline file, together with the scripts automaties this.
• Morpheme-pair counting: almost identical to the above. But not quite. Status: Not started. Needs some design work.
• Compute word-pair mutual information (MI). This gives an idea of 'how strong' a word-pair is. For example, "Northern Ireland" has a high MI, because when these words occur, they often come together. By contrast "and if" has a very low MI, because the two words "and", "if" are rarely in each-others company. MI is needed for the next step. Status: done, and works painfully slowly. The compute-MI file does this.
• Morpheme discovery: Morpheme boundaries are most likely to occur at those places where the MI is the lowest. This is because words morphemes articulate at morpheme boundaries. Status: not started. Contact Linas for mathematical details, proceedural details. This step rquires a lot of data-mining, experimental work, data analysis, theoretical thought and creation of new code. This is a major step. GSOC candidate project! Here is a PDF showing a worked example of the discovery algorithm: File:Morfo-worked-example.pdf.
• Re-parsing with learned morphemes: it almost surely(???) does not make sense to go to the next stage, until a morphology has been leared. Once morphemes are known, they can be treated like word boundaries; text is re-parsed and treted as if each morpheme was a word. Status: Not atarted. Infrastructure needs to be created.
• Maximum Spanning Tree (MST) parsing: parse text, using the word-pairs that have been learned. This is the first "realistic" parse of the input text. It is "well-known" in computational linguistics that MST trees closely resemble the correct dependency parse of a sentence. The primary limitation of MST parsing is that it does not provide part-of-speech (POS) information, and doesn't provide grammatical (subject, object, etc) information. That is learnt in a later step. Status: Done, and works well. Implemented in this file (part of opencog).
• Discover word types and link types. As a result of MST parsing, every word-pair gets its own unique link type. But we know, from general linguistics principles, that this is not needed, and that words can be grouped together into classes (parts of speech), and that link-types, such as subject, object, are fairly generic, linking different parts of speech. The goal of type discovery is to discover these types (kinds, classes). Note: in English, there are hundreds or more "fine-grained" parts of speech, and not just half-a-dozen. The goal here is to simplify, but not too much. Status: GSOC candidate project!
• Disjunct extraction, parse ranking. Given an MST parse, with labelled link-types between words, link-grammar disjuncts can be easily extracted. These need to be collected, so that one can build a dictionary or lexis for Link-grammar: that is, a lookup table, that, given a word, will provide the candidate disjuncts that can be used with that word. In addition, a talley must be made of how often these occur in text, so that a parse-ranking score can be computed (i.e. the MI of each disjunct is that disjuncts 'cost'). Status: GSOC candidate project!
• Link parsing. In order to make use of the lexis that have just been learned, it has to somehow be fed into the actual link-parser. The easiest/best way to do this seems to be to put everything into a database, and pull from there. Status: Straight-forward hacking needed to make this work.
• Morphology. Most natural languages have a non-trivial morphology. It, too, can be learned, following steps that are quite similar to the above, with exception that, instead of splitting a sentence into words, a word is split into candidate morphemes. Status: GSOC candidate project! This is a nice stand-alone project, with less dependency on the rest of the code.

## Processing Details

The below expands on the overview above.

### Word-Pair counting

Learning starts with a blank slate. The goal of the first stage of learning is to simply learn a collection of words, and get some general idea of how often they occur near one-another, and what order they occur in. This is achieved simply by observing raw text.

In principle, its very easy to count word pairs: a few hours to write some perl scripts will do the trick. We don't do that here; we go through the full link-grammar-to-relex-to-opencog pipeline, for two reasons: 1) we'll need this pipeline later, anyway, and 2) its easy, if a bit slow. Again, the whole point is to have opencog learn -- so we have to feed opencog with data it understands.

The observation is done by parsing Wikipedia articles with the Link Grammar "ANY" language. This is a trivial language, that randomly parses any sequence of words presented to it. The resulting parse is a valid Link-Grammar parse (i.e. no links cross, etc.) with word-pairs randomly connected by the ANY link. So for example:

                                     +------ANY------+
+---ANY--+--ANY--+--ANY--+--ANY--+--ANY-+        |
|        |       |       |       |      |        |
LEFT-WALL some[!] text[!] הכלב[!] שרץ[!] была[!] доктор[!]


This is used as a source of random-but-nearby word-pairs, that opencog can collect statistics on. The RelEx server is able to pass such text to link-grammar, and convert the resulting links back to OpenCog atoms. For example:

java $VM_OPTS$RELEX_OPTS \$CLASSPATH relex.Server --lang any -n 99 --link
telnet localhost 4444

(EvaluationLink (stv 1.0 1.0)
(WordInstanceNode הכלב@7327ae70beb2-4063-bbd9-99f5caa53ac4")
(WordInstanceNode "была@7aae5978-2d80-4dc7-a8af-f36d893c206e")
)
)
(WordInstanceNode "была@7aae5978-2d80-4dc7-a8af-f36d893c206e")
(WordNode "была")
)


and so-on.

The resulting OpenCog atoms are automatically passed to the CogServer, where they are counted. The counts are stored in the CogServer database backend. Scripts are proivded to automate all of this.

See /opencog/nlp/learn/README for details on how to set this all up. This stage is fully implemented and works well. Not terribly fast, but fast enough.

### Computing the Mutual Information

Compute word-pair mutual information (MI). This gives an idea of 'how strong' a word-pair is. For example, "Northern Ireland" has a high MI, because when these words occur, they often come together. By contrast "and if" has a very low MI, because the two words "and", "if" are rarely in each-others company. MI is needed for maximum spanning-tree parsing.

MI values typically range from -5 to +20, and very rarely higher or lower. Any MI value above 4 indicates that the pair of words are fairly strongly associated with one-another.

The initial counts of words and word pairs are held in CountTruthValues. This truth value has three fields: one is used to hold the raw count, one is used the hold either the logarithm of the normalized frequency (probability), or the MI.

Word pairs are represented in the following format:

  EvaluationLink   (countTV #log-probability #raw-count)
WordNode "some-word"
WordNode "other-word"


These are (automatically) saved to the SQL database, so that the cogserver can be stopped and started without "forgetting" what has been learned.

The script to compute the MI is in /opencog/nlp/learn/compute-mi.scm. This stage is fully implemented and it works. Caution: this is currently quite slow, bordering on unbearably slow. Something needs to be done about this.

### Maximum Spanning Tree Parsing

Maximum Spanning Tree (MST) parsing: parse text, using the word-pairs that have been learned. This is the first "realistic" parse of the input text. It is "well-known" in computational linguistics that MST trees closely resemble the correct dependency parse of a sentence. The primary limitation of MST parsing is that it does not provide part-of-speech (POS) information, and doesn't provide grammatical (subject, object, etc) information. That is learnt in a later step.

Given a string of words, these can be connected together into a tree many different ways. An MST parse is that tree where the sum total of mutual information (MI) between word pairs is maximized. The result resemble the ANY parse, above, with two very important differences: 1) the link types are no longer "ANY", and 2) the parse is NOT random (they are the link-types that maximize the MI)! Borrowing the previous diagram to illustrate this, an MST parse might look like this:

                            +------L42------+
+--LQ6--+--LP3--+--LR5--+--LW9-+        |
|       |       |       |      |        |
some[!] text[!] הכלב[!] שרץ[!] была[!] доктор[!]


The link types in the example above are unique for the given word-pair. That is, LQ6 is the link type that connects together only two words: some and text; it cannot connect together any other words. Note that the word-pair MI is associated with each link type: the links that were chosen were the ones that maximize the sum of the MI's.

This stage is fully implemented and it works well. The code is in /opencog/nlp/learn/mst-parser.scm

### Type discovery

In the above example, every word-pair got its own unique link type. But we know, from general linguistics principles, that this is not needed, and that words can be grouped together into classes (parts of speech), and that link-types, such as subject, object, are all fairly generic. The goal of type discovery is to discover these types (kinds, classes). Note: in English, there are hundreds or more "fine-grained" parts of speech, and not just half-a-dozen. The goal here is to simplify, but not too much.

Type discovery can be done using the principle of maximum entropy. Consider, for example, a sentence very similar to the above example:

                            +------L42------+
+--LZZ--+--LP3--+--LR5--+--LW9-+        |
|       |       |       |      |        |
more[!] text[!] הכלב[!] שרץ[!] была[!] доктор[!]


It differs from the above in that the LZZ link connects more instead of the LQ6 link connecting some. This suggests that there should be a word type ${\displaystyle \gamma =\{more,some\}}$ that holds the two words more and some. The two link types LZZ and LQ6 should be eliminated, in favor of a single type that connect the word class ${\displaystyle \gamma }$ to text. Maximum entropy suggests that two words (classes) should be merged if the following inequality holds:

${\displaystyle {\frac {2}{N}}\lesssim h_{PAIR}-h_{PAIR}^{reduced}}$

where N is the number of word-pairs observed, and ${\displaystyle h_{PAIR}}$ is the entropy

{\displaystyle {\begin{aligned}h_{PAIR}=-\sum _{w_{1},w_{2}\in \alpha }&p(w_{1},w_{2})\log _{2}p(w_{1},w_{2})\end{aligned}}}

Here, ${\displaystyle w_{1}}$ and ${\displaystyle w_{2}}$ are words, and the sum is taken over all word-pairs ${\displaystyle (w_{1},w_{2})}$, and ${\displaystyle p(w_{1},w_{2})}$ is the probability of having observed the word-pair ${\displaystyle (w_{1},w_{2})}$. The entropy ${\displaystyle h_{PAIR}^{reduced}}$ is defined likewise, except that this time, two words, A and C are removed, and replaced by a type ${\displaystyle \gamma =\{A,C\}}$. That is, A and C are the two words that might be grouped into one type (more and some in the example above). The probabilities sum, so that

${\displaystyle p(w_{1},\gamma )=p(w_{1},A)+p(w_{1},C)}$

and likewise for the left-right reversed case. Most of the sums above cancel, leaving the much smaller sum over words, instead of word-pairs:

{\displaystyle {\begin{aligned}{\frac {2}{N}}\lesssim -\sum _{w\in \alpha \backslash A,C}\left[p(A,w)\log _{2}p(A,w)+p(C,w)\log _{2}p(C,w)-p(\gamma ,w)\log _{2}p(\gamma ,w)\right]\\\qquad -\sum _{w\in \alpha \backslash A,C}\left[p(w,A)\log _{2}p(w,A)+p(w,C)\log _{2}p(w,C)-p(w,\gamma )\log _{2}p(w,\gamma )\right]\end{aligned}}}

Basically, the formula says that if two words A and C can be found that satisfy the inequality, they should be merged into one word-type.

Status: This stage is going to be implemented "real soon now". This is a potential GSOC 2014 project!

Open Questions: How to represent this in the atom space? Need to have some way of writing 'word is member of word-type'.

### Disjunct extraction, parse ranking

In the example sentence,

                           +------L42------+
+--LZZ--+--LP3--+--LR5--+--LW9-+        |
|       |       |       |      |        |
more[!] text[!] הכלב[!] שרץ[!] была[!] доктор[!]


The word text is associated with a disjunct, which can be directly read off of the diagram above: it is LZZ- & LP3+, denoting that, in order for the word text to be usable in a parse, it must connect to the left with an LZZ link, and it must connect to the right with an LP3 link.

These disjuncts need to be computed, for each MST parse, and collected up, as they represent the valid uses of the word text in the various different sentences in which it can appear.

Status: Partly implemented. Finishing this is a valid GSOC 2014 project.

Currently, the code at the end of /opencog/nlp/learn/mst-parser.scm extracts the disjuncts. However, these are not saved anywhere. They need to be saved. Also need to collect statistics on them: how often does each one get used? It will also be important to compute the mutual information for each, as this will provide the link-grammar parse-cost (parse ranking). All of this should probably be done after(?) not before(?) the type-discovery stage, above.

The previous two steps, disjunct extraction and type learning, provide everything that is needed for a link-grammar dictionary to be created. That is, the various word-types are the link-grammar dictionary entries, and the list of disjuncts are the 'syntactic definitions' of those words. The mutual information associated with a disjunct is the 'cost' of that disjunct, and is needed for parse ranking.

The current Link Grammar dictionaries are plain-text files. This is not practical from a systems view-point; there needs to be a way of pulling dictionary entries from a database. This would allow OpenCog to dynamically update the database as it learns, and link-grammar to always use the latest dictionaries as it parses.

Status: Partly implemented. This is probably NOT a GSOC-2014 project, unless you happen to be an excellent programmer, and are willing to write very clean, high-quality code.

### Morphology

Most natural languages have a non-trivial morphology. It, too, can be learned, following steps that are quite similar to the above, with exception that, instead of splitting a sentence into words, a word is split into candidate morphemes.

Here is a PDF showing a worked example of the discovery algorithm: File:Morfo-worked-example.pdf.

Status: Not started. This could be a fairly stand-alone project, and is suitable for a GSOC 2014 project.

### Transfer Learning

Once the above is done, the learned word and link-types can be compared to the existing word and link-types, for English and for Russian. This can help clarify the actual syntactic roles of the learned groups and types. That is, the result of learning will be messy link-type names, such as LASDF, which could very possibly be ore or less identical to the existing S (subject) link ... but we have no way of knowing this, a-priori. By comparing the learned parser to the existing parser, clearer labels could be extracted.

Status: cannot be started until all of the stages above are done.

## Conclusion

Completing the above is a four-step process: 1) writing the code to do things, 2) running the code on the dataset, 3) verifying and pondering the results, to make sure that the code is working correctly, and that the things being learned are what we expect to happen (more or less), and finally, 4) tuning the code (as the first draft will probably be buggy, glacially slow, and be somewhat mis-designed.)

The above 4 steps, combined, are probably a bit more than what can be accomplished during GSOC, but good progress is possible.

Once the above are done, one can then move on to the later stages as described in the ArXiv pre-print. Detailing those is a future task.