Natural language generation

From OpenCog

OpenCog has implemented multiple natural-language generation modules over the years. Two that are in current development are described on the Link Grammar wiki page. Another, that was in active development in 2015, is the combination of the microplanner for generating paragraphs, and the sureal surface realization module for generating individual sentences.

Natural language generation is part of the natural language processing subsystem.

Generation stages

The process of language generation involves a series of stages, which may be defined in various ways, such as:

  • Content determination: figuring out what needs to be said in a given context.
  • Discourse planning: overall organization of the information to be communicated.
  • Lexicalization: assigning words to concepts.
  • Reference generation: linking words in the generated sentences using pronouns and other kinds of reference.
  • Syntactic and morphological realization: the generation of sentences via a process inverse to parsing, representing the information gathered in the above phases.
  • Phonological or orthographic realization: turning the above into spoken or written words, complete with timing (in the spoken case), punctuation (in the written case), etc.

All of these stages are important, and there is a nontrivial amount of feedback among them. However, there is also a significant amount of autonomy, such that it makes sense to analyze each one separately and then tease out its interactions with the other stages. Here we focus on the single stage of syntactic and morphological realization, which we refer to for simplicity as “sentence generation” (taking a slight terminological liberty, as “sentence fragment generation” is also included here).

One of the leading theories of sentence generation is Mel’čuk's Meaning-Text Theory (MTT). It is of interest in the present case for two reasons. Foremost, the shallow syntactic layers of MTT resemble Link Grammar. Next, it becomes apparent that Link Grammar is not just a theory of natural language, but extends very naturally to a theory of graphs and networks, and a theory of learning and knowledge acquisition. Thus, Link Grammar-inspired ideas have percolated into the deeper reaches of the AtomSpace design and the implementation of Atomese. it is thus natural to push in the opposite direction: not only to learn new knowledge networks, but to serialize those networks into a time-ordered sequence of sentences (or robot actions).

Stumbling blocks

The current stumbling blocks to good-quality language generation appear to be an inadequately sophisticated representation of the knowledge graph that is being fed into generation system, together with an inadequate syntactic description of natural language. In particular, the current implementation fails to leverage the MTT concept of a lexical function.

An effort to obtain an improved syntactic description of language is being pursued in the unsupervised language learning project.

SegSim

But how might this work? Our approach, which we label SegSim, is relatively simple and is given as follows:

  • The system stores a large set of pairs of the form (semantic structure, syntactic/morphological realization)
  • When it is given a new semantic structure to express, it first breaks this semantic structure into natural parts, using a set of simple syntactico-semantic rules
  • For each of these parts, it then matches the parts against its memory to find relevant pairs (which may be full or partial matches), and uses these pairs to generate a set of syntactic realizations (which may be sentences or sentence fragments)
  • If the above step generated multiple fragments, they are pieced together
  • Finally, a “cleanup” phase is conducted, in which correct morphological forms are inserted, and articles and certain other “function words” are inserted

Obsolete implementations

As of 2015, the following implementations are obsolete:

  • NLGen (March 2009) seeks to implement a version of the SegSim algorithm, using RelEx to produce semantic structures from sentences. See the NLGen launchpad page. See [1] to browse the source.
  • NLGen2 (2010) is a simplified, streamlined approach to language generation; using a rather different approach, but still based on RelEx and Link Grammar as the underlying technology. See the NLGen2 launchpad page. See [2] to browse the source.

NLGen demos

Just for fun: some youtube video demos of NLGen, circa 2009: