The process of language generation involves a series of stages, which may be defined in various ways, such as:
- Content determination: figuring out what needs to be said in a given context
- Discourse planning: overall organization of the information to be communicated
- Lexicalization: assigning words to concepts
- Reference generation: linking words in the generated sentences using pronouns and other kinds of reference
- Syntactic and morphological realization: the generation of sentences via a process inverse to parsing, representing the information gathered in the above phases
- Phonological or orthographic realization: turning the above into spoken or written words, complete with timing (in the spoken case), punctuation (in the written case), etc.
All of these stages are important, and there is a nontrivial amount of feedback among them. However, there is also a significant amount of autonomy, such that it makes sense to analyze each one separately and then tease out its interactions with the other stages. Here we focus on the single stage of syntactic and morphological realization, which we refer to for simplicity as “sentence generation” (taking a slight terminological liberty, as “sentence fragment generation” is also included here).
Sentence generation may be achieved in many ways; the variety of algorithms in the literature surely barely scratches the surface of the scope of possibilities. However, when one thinks about how sentence generation might most feasibly be achieved in a brain or brainlike system, the range of possibiltiies narrows somewhat. Brains are not particularly good at carrying out complex serial algorithms requiring explicit exploration of large search trees. On the other hand, they are very good at carrying out approximate similarity matching of target items against large sets of remembered items. Thus, the most neurally feasible sentence generation approach would be one that eschews complex algorithmic search processes in favor of massively parallel similarity matching.
But how might this work? Our approach, which we label SegSim, is relatively simple and is given as follows:
- The system stores a large set of pairs of the form (semantic structure, syntactic/morphological realization)
- When it is given a new semantic structure to express, it first breaks this semantic structure into natural parts, using a set of simple syntactico-semantic rules
- For each of these parts, it then matches the parts against its memory to find relevant pairs (which may be full or partial matches), and uses these pairs to generate a set of syntactic realizations (which may be sentences or sentence fragments)
- If the above step generated multiple fragments, they are pieced together
- Finally, a “cleanup” phase is conducted, in which correct morphological forms are inserted, and articles and certain other “function words” are inserted
The codebase currently called NLGen (March 2009) seeks to implement a version of the SegSim algorithm, using RelEx to produce semantic structures from sentences. See the NLGen launchpad page.