Language learning

From OpenCog

Learning a natural language, without supervision, is one of the AGI challenges. Am early draft proposal for how this could be done is available on ArXiv, as Learning Language from a Large (Unannotated) Corpus. The currently running project can be found in the [1] github repo.

This page describes the implementation and status as in was in the 2014-2017 time-frame. As of 2021, the project is still (very) active, although the details and the theoretical foundations have shifted a bit as we've gotten smarter. The best and most accurate status can be found in the github repo. The general survey below is still more-or-less accurate.

We hope the details below are sufficient for you to understand and contribute to this project in a meaningful way. Feel free to submit questions to the OpenCog mailing list or the Link Grammar mailing list.

Reading List

Understanding the 'big picture' requires extensive reading. The following is a reading list. Pretty much everything on it is mandatory, and it is probably best to read in the order given below. There's a lot to read here. Thus, here is a shortened selection:

  • The Link Grammar wiki page describes Link Grammar and it's generalization.
  • The README.md project README file describes the project in general.
  • The opencog/sheaf github repo and the sequence of PDF's in the opencog/sheaf/docs docs subdirectory establish the theoretical foundation of representing a knowledge graph in terms of the sections of a sheaf.

Link Grammar

Understanding Link Grammar is central and primary. The Wikipedia article provides a reasonable introduction. The system documentation provides some additional detail. There is also the primary website. However, the original articles are the most important to read:
  • Daniel Sleator and Davy Temperley. 1991. Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report CMU-CS-91-196, October 1991. PDF
  • Daniel Sleator and Davy Temperley. 1993. Parsing English with a Link Grammar. Third International Workshop on Parsing Technologies. (This is a shorter but more up-to-date version of the technical report above. It contains a better introduction to link grammars, and gives a more detailed description of the relationship between link grammar and other formalisms.) PDF
The take-away concepts: Link Grammar performs parsing by starting with a dictionary or lexis containing lists of words. Each such collection of words is associated with a collection of disjuncts. In turn, each disjunct is a sequence of connectors, with the connectors determining how a word may connect, or link, to another word. Disjuncts should be visualized as jigsaw-puzzle pieces, with a word printed on the puzzle piece. A sentence parses if and only if the jigsaw pieces snap together. The jigsaw concept appears in other places: for example, this pop-sci article:
The use of the word "quantum" is not accidental; it is related to type theory and linear logic; see "supplemental reading" below.
These concepts have been abstracted into Atomese Atom types, specifically the Section, the ConnectorSeq, the Connector and the ConnectorDir.

Mutual Information and MST Parsing

A basic grammar can be discovered by computing the mutual information between word-pairs, and then using a Maximum Spanning Tree (MST) to obtain an un-labelled parse. There are many different papers and resources on these topics, but perhaps the best and most readable is Deniz Yuret's original PhD thesis. Understanding the ideas here are centrally important.
Secondary reading involves studying various maximum entropy principles.
The Yuret MST parse of a sentence is a dependency parse, and thus resembles a link-grammar parse, except that the links are unlabelled. A big part of the current project is the automated conversion of these unlabelled MST parses into labelled disjuncts. A big part of the "magic" is that this extraction improves the accuracy; MST parses are well-known for being OK, but not great. Anyway, Yuret does NOT explain this, because he did not know; the disjunct extraction process is new and specific to this project.

Synonyms, Antonyms

Given a dependency parse, Dekang Lin explains how to discover synonyms. Poon and Domingos generalize the process to the discovery of synonymous phrases.
  • Dekang Lin and Patrick Pantel. Dirt: Discovery of inference rules from text. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01), pages 323–328. ACM Press, 2001.
  • Hoifung Poon. Pedro Domingos. Unsupervised Semantic Parsing In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1–10, Singapore, August 2009. Association for Computational Linguistics.
Other writings by Lin can inform a bit more:

Semantics from the MTT perspective

The ultimate goal is to extract meaning from text. MTT, or Meaning Text Theory (Wikipedia) offers an insightful framework for this. Aside from the DSyntR structures, and the concepts of "rheme" and "theme", another very important concept is that of the lexical function.
  • James Steele, editor. Meaning-Text Theory: Linguistics, Lexicography, and Implications. University of Ottowa Press, 1990.
Then there is this 410-page monster; it is supplementary reading; I haven't read it myself:

Semantics from other perspectives

While MTT gives a general structural means to think about meaning, there are other illuminating ways to think about it as well. One is given below:
The above is particularly interesting and important, if/when one realizes that one can replace WordNet and PageRank, which are central to the above, by something like PLN, so that one is performing short-distance reasoning between phrases or N-grams or sentences fragments, and then re-assembling these local reasoning snippets, and selecting the most likely global interpretation for a sentence. It is here where concepts of logic and reasoning and memory and knowledge start becoming important.

Supplementary reading

Additional reading and background.
  • Type theory is important, because the Link Grammar connectors, links and disjuncts are best understood as types.
  • The first 2-3 chapters of the Homotopy Type Theory book give a good overview of types, and what they are.
  • John Baez, Mike Stay, Physics, Topology, Logic and Computation: A Rosetta Stone (2009) gives a fairly direct overview of category theory and monoidal categories. It does not talk about natural language, but it does provide the bridge as to why "quantum" sometimes shows up in natural language. The thing to look for here is the "tensor monoid" or monoidal category. This category just-so happens to have the same structure as Link Grammar, in that the disjuncts occurring in the lexical entries of the Link Grammar dictionary are described by linear logic. There are additional connections to proof theory, modal logic, and intuitionistic logic.
  • The pregroup grammar provides the most direct, immediate and concrete articulation of the connection between monoidal categories and natural language. Note that pregroup grammars are more-or-less the same thing as Link Grammar, modulo details that kind-of don't matter. Some of the papers cited there provide more.

Technical Reading

The current software system requires understanding the following computer technologies:
  • Its critical to understand scheme, because all of the processing code is written in scheme.
  • All data is represented in terms of highly sparse vectors and matrices. The AtomSpace provides a generic API for working with these: the {{github-atospace|opencog/matrix} API. This API allows any arbitrary collection of pairs of things (pairs of atoms) to be interpreted as a matrix, and thus provides a toolkit for working with that matrix and the row/column vectors in it. This includes assorted marginal sums, conditional probabilities, mutual information, dot-products, Jacquard and Hamming distances, and so on.
  • The knowledge graph is represented in terms of "sheaves". These are generalizations of the core concepts found in Link Grammar. The details of this generalization are given in the Template:Opencog-atomspace github repo and in particular, as a sequence of PDF's in the opencog/sheaf/docs subdirectory. In brief, all data is in the form of "lexical entries" (the germs of a sheaf) and the connectivity information is represented with a Section. Each Section is a "pair of things": a word or word-class, and a ConnectorSeq of the possible ways in which that word can connect to other neighboring words. See the Link Grammar wiki page for a rapid overview.

Project-specific reading

There are diaries, status reports and README's specif to this project.

Steps to learning

The general entry point to the learning software is given in the project Template:Opencog-learn file. It describes two basic flows: in one case, on attempts to learn by examining a large natural-language corpus. This flow is "uncalibrated" in that one has no a priori technique to assess the accuracy of the results. The second flow starts with an "artificial" or "fake" language, randomly generated so as to resemble a real language. This allows calibration of the pipeline, as the learned language can be directly compared to the known input language.

A quick summary of the general process is given below. It's a sketch; refer to the github project for more precise and up-to-date information.


  • Setup infrastructure:. Download, compile, setup the AtomSpace and all needed tools.
  • Download and cleanup corpus. Wikipedia is not recommended; it has very few action verbs (run, jump, hit, ...') that would be common in ordinary speech, and has far too many geographical place names, product names and technical jargon. For the calibrated pipeline, an artificial language and corpus is generated.
  • Word-pair counting: collect statistics for nearby pairs of words. This is needed to get an idea of what words are typically near each other, and what order they typically come in. Status: Done, and works well. The pipeline file, together with the scripts automate this.
  • Morphology processing: for non-English languages with concatenative morphologies: Run the text corpus through the the Link Grammar parser, using either the ADY or the AMY (but not the ANY) "language". The ADY parser randomly splits words into pairs of morphemes; the AMY parser splits words into two or three random morphemes; both then create a random parse tree. Thus, it is similar to the above, except that the morpheme boundaries are not yet known. The ADY and AMY parsers are defined here (part of link-grammar) and here. Note that the AMY parser is not entirely stable.
  • Morpheme-pair counting: almost identical to word-pair counting. But not quite. Status: Not started. Needs design work.
  • Compute word-pair mutual information (MI). This gives an idea of 'how strong' a word-pair is. For example, "Northern Ireland" has a high MI, because when these words occur, they often come together. By contrast "and if" has a very low MI, because the two words "and", "if" are rarely in each-others company. MI is needed for the next step. Status: done, and works painfully slowly. The compute-MI file does this.
  • Morpheme discovery: Morpheme boundaries are most likely to occur at those places where the MI is the lowest. This is because words morphemes articulate at morpheme boundaries. Status: not started. Contact User:Linas for mathematical details, procedural details. This step requires a lot of data-mining, experimental work, data analysis, theoretical thought and creation of new code. This is a major step. Here is a PDF showing a worked example of the discovery algorithm: File:Morfo-worked-example.pdf.
  • Re-parsing with learned morphemes: it almost surely(???) does not make sense to go to the next stage, until a morphology has been learned. Once morphemes are known, they can be treated like word boundaries; text is re-parsed and treated as if each morpheme was a word. Status: Not started. Infrastructure needs to be created.
  • Maximum Spanning Tree (MST) parsing: parse text, using the word-pairs that have been learned. This is the first "realistic" parse of the input text. It is "well-known" in computational linguistics that MST trees closely resemble the correct dependency parse of a sentence. The primary limitation of MST parsing is that it does not provide part-of-speech (POS) information, and doesn't provide grammatical (subject, object, etc) information. That is learnt in a later step. Status: Done, and works well. Implemented in mst-parser.scm (atomspace/sheaf).
  • Disjunct extraction, parse ranking. Given an MST parse, with labelled link-types between words, disjuncts can be easily extracted. These are then gathered and counted, so that relative frequency counts become available. This forms the next-level candidate dictionary for a language: the disjuncts capture how like a word is to appear in the context of other neighboring words. In this sense, disjuncts resemble the "skip-gram" from artificial neural nets. The mutual information between a word and its collection of disjuncts provides a measure of the liklihood or correctness of that grammatical form. Status: Implemented in make-section.scm with a utility wrapper in mst-parser (learn project).
  • Classification and clustering aka discovery of word types and link types. The collection of disjuncts obtained above describes linkages between individual words. Of course, it is known from general linguistics principles, that words can be grouped together into classes (parts of speech), and that link-types, such as subject, object, are fairly generic, linking different parts of speech. The goal of type discovery is to discover these types (kinds, classes). Note: in English, there are hundreds or more "fine-grained" parts of speech, and not just half-a-dozen. The goal here is to simplify, but not too much. Status: Implemented across multiple files in the learn project scm directory having "gram" in their name. The general process is to interpret the word-disjunct pairings as a "vector" attached to the word. The similarity of two different words can be obtained with any one of several vector-similarity metrics. The similarity can then be used to make clustering decisions as to which words are "grammatically similar" to one another.
  • Dictionary export After grammatical classification has been performed, the resulting dictionary needs to be placed into a format that the link-grammar parser understands. Status: Done in export-disjuncts.scm.
  • Evaluation The results of learning need to be evaluated for accuracy. Status: in progress. A jumble of tools and ideas are in the 'step five-compare' directory.
  • Morphology. Most natural languages have a non-trivial morphology. It, too, can be learned, following steps that are quite similar to the above, with exception that, instead of splitting a sentence into words, a word is split into candidate morphemes. Status: Barely started. a smidgen of code and infrastructure has been created.

Processing Details

The below expands on the overview above.

Word-Pair counting

Learning starts with a blank slate. The goal of the first stage of learning is to simply learn a collection of words, and get some general idea of how often they occur near one-another, and what order they occur in. This is achieved simply by observing raw text.

In principle, its very easy to count word pairs: a few hours to write some perl scripts will do the trick. We don't do that here; we use the link-grammar parser to generate planar trees, and count the word-pairs occurring in the tree. The primary reason for doing it this way is to store the pair-data in the AtomSpace. By storing all data in the AtomSpace, assorted system integration issues are avoided. Flip side is that it is slow, perhaps 10x or 100x slower than carefully-designed custom code might be. C'est la vie.

The observation is done by parsing with the Link Grammar "ANY" language. This is a trivial language, that randomly parses any sequence of words presented to it. The resulting parse is a valid Link-Grammar parse (i.e. no links cross, etc.) with word-pairs randomly connected by the ANY link. So for example:

                                     +------ANY------+
    +---ANY--+--ANY--+--ANY--+--ANY--+--ANY-+        |
    |        |       |       |       |      |        |
LEFT-WALL some[!] text[!] הכלב[!] שרץ[!] была[!] доктор[!] 

The resulting parses are represented in Atomese, in the sentence representation format. For example:

(EvaluationLink (stv 1.0 1.0)
  (LinkGrammarRelationshipNode "ANY")
  (ListLink
     (WordInstanceNode הכלב@7327ae70beb2-4063-bbd9-99f5caa53ac4")
     (WordInstanceNode "была@7aae5978-2d80-4dc7-a8af-f36d893c206e")
  )
)
(ReferenceLink (stv 1.0 1.0)
  (WordInstanceNode "была@7aae5978-2d80-4dc7-a8af-f36d893c206e")
  (WordNode "была")
)

and so-on. The resulting word-pairs are counted. The implementation is in link-pipeline.scm.

Computing the Mutual Information

Compute word-pair mutual information (MI). This gives an idea of 'how strong' a word-pair is. For example, "Northern Ireland" has a high MI, because when these words occur, they often come together. By contrast "and if" has a very low MI, because the two words "and", "if" are rarely in each-others company. MI is needed for maximum spanning-tree parsing.

MI values typically range from -5 to +20, and very rarely higher or lower. Any MI value above 4 indicates that the pair of words are fairly strongly associated with one-another.

The initial counts of words and word pairs are held in CountTruthValues. This truth value has three fields: one is used to hold the raw count, one is used the hold either the logarithm of the normalized frequency (probability), or the MI.

Word pairs are represented in the following format:

  EvaluationLink   (countTV #log-probability #raw-count)
     LinkGrammarRelationshipNode "ANY"
     ListLink
        WordNode "some-word"
        WordNode "other-word"

These are (automatically) saved to the database, so that the counting process can be stopped and started without "forgetting" what has been learned.

The script to compute the MI is in /scm/compute-mi.scm. This stage is fully implemented and it works. Its a bit slow.

Maximum Spanning Tree Parsing

Maximum Spanning Tree (MST) parsing: parse text, using the word-pairs that have been learned. This is the first "realistic" parse of the input text. It is "well-known" in computational linguistics that MST trees closely resemble the correct dependency parse of a sentence. The primary limitation of MST parsing is that it does not provide part-of-speech (POS) information, and doesn't provide grammatical (subject, object, etc) information. That is learnt in a later step.

Given a string of words, these can be connected together into a tree many different ways. An MST parse is that tree where the sum total of mutual information (MI) between word pairs is maximized. The result resemble the ANY parse, above, with two very important differences: 1) the link types are no longer "ANY", and 2) the parse is NOT random (they are the link-types that maximize the MI)! Borrowing the previous diagram to illustrate this, an MST parse might look like this:

                            +------L42------+
    +--LQ6--+--LP3--+--LR5--+--LW9-+        |
    |       |       |       |      |        |
 some[!] text[!] הכלב[!] שרץ[!] была[!] доктор[!] 

The link types in the example above are unique for the given word-pair. That is, LQ6 is the link type that connects together only two words: some and text; it cannot connect together any other words. Note that the word-pair MI is associated with each link type: the links that were chosen were the ones that maximize the sum of the MI's.

This stage is fully implemented and it works well. The code is in /opencog/sheaf/mst-parser.scm

Type discovery

In the above example, every word-pair got its own unique link type. But we know, from general linguistics principles, that this is not needed, and that words can be grouped together into classes (parts of speech), and that link-types, such as subject, object, are all fairly generic. The goal of type discovery is to discover these types (kinds, classes). Note: in English, there are hundreds or more "fine-grained" parts of speech, and not just half-a-dozen. The goal here is to simplify, but not too much.

Type discovery can be done using the principle of maximum entropy. Consider, for example, a sentence very similar to the above example:

                            +------L42------+
    +--LZZ--+--LP3--+--LR5--+--LW9-+        |
    |       |       |       |      |        |
 more[!] text[!] הכלב[!] שרץ[!] была[!] доктор[!] 

It differs from the above in that the LZZ link connects more instead of the LQ6 link connecting some. This suggests that there should be a word type that holds the two words more and some. The two link types LZZ and LQ6 should be eliminated, in favor of a single type that connect the word class to text. Maximum entropy suggests that two words (classes) should be merged if the following inequality holds:

where N is the number of word-pairs observed, and is the entropy


Here, and are words, and the sum is taken over all word-pairs , and is the probability of having observed the word-pair . The entropy is defined likewise, except that this time, two words, A and C are removed, and replaced by a type . That is, A and C are the two words that might be grouped into one type (more and some in the example above). The probabilities sum, so that

and likewise for the left-right reversed case. Most of the sums above cancel, leaving the much smaller sum over words, instead of word-pairs:


Basically, the formula says that if two words A and C can be found that satisfy the inequality, they should be merged into one word-type.


Status: The above describes the ideas as they were in 2014. See the github repo for the current status. Basically, it has been implemented and is being tuned in various ways.

Disjunct extraction, parse ranking

In the example sentence,

                           +------L42------+
   +--LZZ--+--LP3--+--LR5--+--LW9-+        |
   |       |       |       |      |        |
more[!] text[!] הכלב[!] שרץ[!] была[!] доктор[!] 

The word text is associated with a disjunct, which can be directly read off of the diagram above: it is LZZ- & LP3+, denoting that, in order for the word text to be usable in a parse, it must connect to the left with an LZZ link, and it must connect to the right with an LP3 link.

These disjuncts need to be computed, for each MST parse, and collected up, as they represent the valid uses of the word text in the various different sentences in which it can appear.

Status: Done, see above.

Live link-grammar parsing

The previous two steps, disjunct extraction and type learning, provide everything that is needed for a link-grammar dictionary to be created. That is, the various word-types are the link-grammar dictionary entries, and the list of disjuncts are the 'syntactic definitions' of those words. The mutual information associated with a disjunct is the 'cost' of that disjunct, and is needed for parse ranking.

The current Link Grammar dictionaries are plain-text files. This is not practical from a systems view-point; there needs to be a way of pulling dictionary entries from a database. This would allow OpenCog to dynamically update the database as it learns, and link-grammar to always use the latest dictionaries as it parses.

Status: Partly implemented. Dictionaries can be exported in batch mode to LG. But the parser itself cannot directly suck dictionaries out of the atomspace.

Morphology

Most natural languages have a non-trivial morphology. It, too, can be learned, following steps that are quite similar to the above, with exception that, instead of splitting a sentence into words, a word is split into candidate morphemes.

Here is a PDF showing a worked example of the discovery algorithm: File:Morfo-worked-example.pdf.

Status: Not started. This could be a fairly stand-alone project, and is suitable for a GSOC 2014 project.

Transfer Learning

Once the above is done, the learned word and link-types can be compared to the existing word and link-types, for English and for Russian. This can help clarify the actual syntactic roles of the learned groups and types. That is, the result of learning will be messy link-type names, such as LASDF, which could very possibly be ore or less identical to the existing S (subject) link ... but we have no way of knowing this, a-priori. By comparing the learned parser to the existing parser, clearer labels could be extracted.

Status: cannot be started until all of the stages above are done.

Conclusion

The basic outline described above, excluding morphology, has been implemented. The piepline is currently being evaluated for accuracy, and tuned as appropriate. We hope to hit a milestone "real soon now", which would then enable the next steps: automatic learning of reference resolution, and deep semantic structure, in general.

References