Link Grammar

From OpenCog

Link Grammar is a specific theory of natural language that has heavily influenced the design of the AtomSpace and other parts of OpenCog.

Overview

The Wikipedia article on Link Grammar provides a reasonable overview of the theory. In short, it is a kind of dependency grammar that indicates how the words in a sentence are related to one-another. It encodes the syntax of natural language, and, due to it's nature, it also encodes a semantics to a shallow degree. In this last sense, it provides a natural bridge to the ideas of deep syntax and semantics, as described by Meaning-Text Theory (MTT).

The basic concepts used in Link Grammar are those of "connectors" and "disjuncts". Connectors define how two words may connect to one-another, while disjuncts are a sequence of connectors, all of which must be satisfied in order to obtain a valid parse of a sentence. A "word" in Link Grammar is a lexical entry that associates to each natural-language word a collection of disjuncts. It becomes apparent that disjuncts are a kind of fine-grained "part of speech". For example, transitive verbs require both a subject and an object. Link Grammar defines connectors to subjects and objects, and associates a disjunct to verbs that has both connectors in them.

Generalization

Deeper exploration indicates that the concept of connectors, connector sequences and lexical entries are generically applicable to graphs and graph theory. Link Grammar links correspond to the edges of a graph. Connectors correspond to "half-edges": the possibility of creating a full edge by connecting together two half-edges. A connector sequence corresponds to a vertex in a graph. These basic concepts can be further shown to obey the defining axioms of a sheaf, as in sheaf theory. The page on Connectors and Sections reviews how this generalization is broadly applicable to computing and graphs.

These core concepts have been reified into AtomSpace Atom types. These are:

  • Connector corresponds to the idea of a Link Grammar connector. It generalizes the connector by comprising it of two parts: an identifier for the connection type, and a connection constraint, the SexNode.
  • The SexNode generalizes the idea of the "+" and "-" directions in a Link Grammar connector. The Sex can hold arbitrary user-defined connection constraints.
  • The ConnectorSeq generalizes the idea of a Link Grammar "disjunct". It is an ordered sequence of connectors.
  • The Section generalizes the idea of a dictionary entry in the Link Grammar dictionary. It consists of two parts: the "germ" (corresponding to the word of the lexical entry) and the associated ConnectorSeq (the associated disjunct).

Due to the typing nature of the AtomSpace, it is useful to define some subsidiary types. These aren't really central concepts, but they do provide a set of base classes from which other key Atom types are derived. These are:

  • LexicalNode - a base class for anything that can be looked up in a dictionary. It's the base class for WordNodes.
  • BondNode corresponds to the Link Grammar concept of a "link". Unfortunately, the word "link" is already heavily used in the AtomSpace, so it is renamed to "bond". This is provisional; this atom type is currently not used very much.

The details of this generalization are reviewed the github sheaf directory, and specifically as a collection of PDF's that can be found in the sheaf/docs subdirectory.

Parser

Link Grammar has an implementation, it can be found in the Link Grammar github repo. That implementation contains source code for a parser, as well as hand-crafted dictionaries for English and Russian, as well as prototypes/proofs-of-concept for another eight natural languages (Kazakh, Turkish, Persian, Arabic, Lithuanian, German, Vietnamese, Indonesian, Hebrew). The main Link Grammar web page provides more info. The intro web page provides a detailed introduction to Link Grammar, specifically.

Within OpenCog, the parser and the dictionaries can be accessed directly. The LgDictNode provides a method for looking up word expressions in a given dictionary, and importing them into the AtomSpace (as Atomese). The LgParseLink provides an interface to the parser. Given a sentence, it will use the Link Grammar parser to parse the sentence, and place the parse results into the AtomSpace as Atomese.

Generation

Given a dictionary and/or a collection of Sections, one might want to generate grammatically-valid sentences, based on the syntax rules encoded in that dictionary. There are more than a half-dozen different efforts to do this, in various states of development and disrepair. Two are in active development are:

  • The link-generator command that is available with the main Link Grammar source code (or rather, will be available in version 5.9.0, once the "generate" branch in github is merged). This command is in active development. In it's current form, it allows a sequence of wild-cards to be specified, and it will "fill in the blanks" with any words that would be grammatically valid in those locations. For example, using the input "This is a * idea", it will insert all possible adjectives for the star,
  • The opencog/generate framework, using the generalized sheaf concepts sketched above. The goal is to be able to assemble Sections into sheaves, in the general sense. So far, a proof-of-concept has been implemented. It can expand simple Link Grammar dictionaries into valid sentences. It can also expand more general graph specifications into all possible graphs that match the graph constraints.

Two more systems are in stasis. They were developed, and they more-or-less "work" for an appropriate definition of "work". They are not being actively maintained at this time, and they might be bit-rotting. They are Surface Realization (SuReal) and Microplanner. The general idea is that one can assemble a collection of concepts, and they will be converted into sentences. Sureal is responsible for generating individual sentences, while the Microplanner is responsible for generating paragraphs.

My (User:Linas) personal impression is that Sureal and Microplanner are stagnant because of the inadequate development of a theory of "deep syntax" and semantics. In particular, these two are not properly founded on MTT, and fail to make use of MMT Lexical functions. Basically, the input into these two is not sufficiently sophisticated to allow these systems to be articulate the way a human might be. Reaching that level of sophistication is one of teh project goals.

A general overview can be found in the wiki page for natural language generation.

Learning

The English and Russian dictionaries that are provided with Link Grammar are hand-crafted. It is clear that hand-crafting dictionaries is not a viable long-term strategy for AGI. Practical experience shows that the hand-crafted dictionaries are just barely suitable for use as stepping-stones or building blocks. Thus, there is a need to automatically learn grammar and syntax, in an unsupervised fashion. This is effected in the unsupervised language learning project.