Concept formation

From OpenCog

Some experimental work for concept formation was done in 2009, and is described below. That work was within the context of the NLP pipeline. The opencog/nlp/chatbot-old/seme/README file provides more detail. The below is a cut-n-paste of a (possibly outdated) version of the README file. For the latest info, consult the README directly.

Although the work described below took place a long time ago, many of the ideas described below remain generally valid, even if the actual code-base is no longer in use. This wiki page does describe a fairly straight-forward mechanism for representing conceptual, linguistic knowledge in the AtomSpace.


This directory contains some notes and experimental code for extracting conceptual entities or "semes" from English text. An example of a "conceptual entity" would be the "Great Southern Railroad": a business, a railway, that existed at a certain point in space and time. This task encompasses what is commonly called "named entity extraction". This task is meant to be more general, encompassing what is called "reference resolution" (including pronoun resolution) as well.

The primary challenges of seme extraction are:

  1. Constructing a data representation that is amenable to reasoning, and to question answering.
  2. Recognising when two different semes refer to the same concept, (so that they can be merged)
  3. Recognizing when one seme (incorrectly) refers to distinct concepts, and should be split apart.
  4. Learning new conceptual classifications and relations; so, for example, when encountering a new word, to determine that it is, for example, a color.

Why semes?

The notion of a seme is introduced here to solve a "simple" technical problem in knowledge representation. The goal of using "semes" is to move away from using words and/or word-instances to represent "things". Thus, "Mary's shoes" and "Tom's shoes" would be two different semes, because they are two different "things" (although both things are related to the word "shoe").

By contrast, "Mary had some shoes. The shoes were red." has two distinct word-instances: the word-instance "shoe" in the first sentence, and the word instance "shoe" in the second sentence. However, these two words stand for the same concept: "Mary's red shoes". The goal of using semes is to collapse these two word-instances into one seme, without getting tangled with the fact that these two different word-instances of "shoe" are involved in different syntactic relations in different sentences (... sentences that were possibly uttered by different individuals at different times, even!)

Thus, "semes", as defined here, are meant to be an abstraction that behaves much like "concepts" (whatever that may be). Yet, they are also meant to be fairly closely related to "words": they are only one small step towards the general goal of "conceptualization". In particular, semes are meant to be sufficiently word-like that they can be used in most relations that words are used in. so, for example, if there's a RelEx2Frame structure that connects two words, then one could have exactly the same structure connecting two semes.

Thus, "semes" are removed by only a small step from linguistic usage; they provide a needed abstraction on the road to true "concepts" and are just flexible enough to support basic tasks, such as (basic) entity identification, reference resolution, and (basic) reasoning.

What is a Concept?

So far, almost all the processing described in nlp/triples/README has been in terms of graph modifications performed on individual sentences, containing WordInstanceNodes, and links to WordNodes. In order to promote text input into concepts, and to reason with concepts, we need to define what a concept is, and where its boundaries extend to. At this point, the goal is not to define a super-abstract notion of a concept that passes all epistomological tests, but rather a practical if flawed data structure that is adequate for representing data learned by reading, learned through linguistic corpus analysis. The emphasis here is "flawed but practical": it should be just enough to take us to the next level of abstraction.

Naively, a concept would seem to have the following parts:

  • a SemeNode, to serve as an anchor (could be ConceptNode)
  • a linguistic expression complex.
  • a WordNet sense tag (optional)
  • a DBPedia URI tag (optional)
  • an OpenCyc tag ... etc. you get the idea.
  • basic ontological links -- is-a, has-a, part-of, etc.
  • prepositional relations (next-to, inside-of, etc)
  • Context tag(s). See section "Context" below.

What is a "linguistic expression complex"? It deals with the idea that most concepts are not expressible as single words: for example, "Mary had a red baloon". The head concept here is "baloon": it is an instance of the class of all baloons, and specifically, this instance is red. Thus, a "linguistic expression complex" would consist of:

  • a head WordNode, to give single, leading name. Possibly several WordNodes to give it multiple names?
  • dependent modifier tags (e.g. "red")
  • a part-of-speech tag, to provide a rough linguistic categorization
  • a collection of disjunct tags, representing possible linguistic use of the WordNode to represent this concept.

Promoting Words to Concepts

Consider the task of promoting word-relations to concepts. Consider the following relationships:

  is_a(bark, sound)
  part_of(bark, tree)

We know that these two relations refer to different senses of the word "bark". Yet, if these two are deduced by reading, how should the system recognize that two different concepts are at play? How should the self-consistency of a set of relations be assessed? Assuming that the input text is not intentionally lying, then, under what circumstances do a set of conflicting assertions require that the underlying word be recognized as embodying two different concepts?

One possible approach is to assign tentative WordNet-based word-senses using either the Mihalcea algorithm, or table-lookup from syntax-tagged senses (see the wsd-post/README for details). One nice aspect of WordNet tagging is that the built-in WordNet ontology can be used to double-check, strengthen certain sense assignments: this, for example:

   bark%1:20:00:: has part-holonym tree_trunk%1:20:00::


   bark%1:11:02:: has direct hypernym noise%1:11:00::
   and inherited hypernym sound%1:11:00::

Thus, triples that have been read in, and tagged with WordNet senses, can be verified against the WordNet ontology for the correctness of sense assignments. While this is a reasonable starting point, and gives an easy leg-up, it does not solve the more general problem of distinguishing and refining concepts.

Another approach is to use part-of-speech tags, and disjunct tags, as stand-ins for word senses. That is, the parser has already identified different word-instances according to their part-of-speech, and so at least a rough word-sense classification is available from that. That is, it is safe to assume that a noun and a verb never represent the same concept (at a certain level...). It has also been seen (see wsd-post/README) that the disjunct used during parsing has a high correlation with the word-sense; the disjunct used during parsing can be considered to be a very fine-grained part-of-speech tag. Thus, instead of using Wordnet sense tags as concept "nucleation centers", the disjuncts could be used as such.

Two distinct processes are at play: 1) recognizing that two different word instances refer to the same concept, and 2) recognizing that a previously learned concept should be refined into two distinct concepts. (For example, having learned the properties of a "pencil", one must recognize at some point that a "mechanical pencil" and a "wooden pencil" have many incompatbile properties, and thus the notion of a pencil must be split into these two new concepts).

The most direct route to either of these processes is by means of "consistency checking": using forward and backward chaining to determine whether two distinct statements are compatible with each other. When they are, then the two different word-instances can be assumed to refer to the same concept; relationships can then be merged.


Almost all facts are contextual. You can't just say "John has a red ball" and promote that to a fact. You must presume a context of some sort: "Someone said during an IRC chat that John has a red ball", or, "While reading Emily Bronte, I learned that John had a red ball." The context is needed for two reasons:

1) When obtaining additional info within the same context, it is simpler/safer to deduce references, e.g. that the John in the second sentence is the same John as in the first sentence.

2) When obtaining additional info within a different context, it is simpler/safer to assume that references are distinct: that, for example, "John" in an Emily Dickinson novel is not the same "John" in an Emily Bronte novel.

Thus, it makes sense to tag recently formed SemeNodes with a context tag.

A priori vs. Deduced Knowledge

Consider the following:

  capital_of(Germany, Berlin)

This triple references a lot of a-priori knowledge. We know that capitals are cities; thus there is a strong temptation to write a processing rule such as "IF ($var0,capital) THEN ($var0,city)". Similarly, one has a-priori knowledge that things which have capitals are political states, and so one is tempted to write a rule asserting this: "IF (capital_of($var0, $var1)) THEN political_state($var1)".

A current working assumption of what follows is that the various rules will/should encode a minimum of a-priori "real-world" knowledge. Instead, the goal here is to create a system that can learn, deduce such "real-world" knowledge.

Definite vs. Indefinite

There is a subtle semantic difference between triples that describe definite properties, vs. triples that describe generic properites, or semantic classes. Thus, for example, "color_of(sky,blue)" seems unambiguous: this is because we know that the sky can only ever have one color (well, unless you are looking at a sunset). Consider "form_of(clothing, skirt)": this asserts that a skirt is a form_of clothing, and not that clothing is always a skirt. The form_of indicates a semantic category. Similarly, "group_of(people, family)" asserts that a family is a group_of people, and not that groups of people are families.

The distinction here seems to be whether or not the modifier was definite or indefinite: "THE color of ...." vs. "A form of.." or "A group of..."

XXX This is a real bug/hang-up in the triples processing code: being unaware of this distinction seems to cause some triples to come out "backwards" (i.e. that clothing is always a skirt). Caution to be used during seme formation!

Learning Semantic Categories

Consider the category of "types of motion". Currently, the RelEx frame rules include an explicit list of category members:


This list clearly encodes a-priori knowledge about locomotion. It would be better if the members of this category could be deduced by reading. There are three ways in which this might be done. One might someday read a sentence that asserts "Crawling is a type of locomotion". This seems unlikely, as this is common-sense knowledge, and common-sense knowledge is not normally encoded in text. A second possibility is to learn the meaning of the word "crawl" the way that children learn it: to have someone point at a centipede and say "gee, look at that thing crawl!" Such experiential, cross-sensory learning would indeed be an excellent way to gain new knowledge. However, there are two snags: 1) It presumes the existence of a teacher who already knows how to use the word "crawl", and 2) It is outside of the scope of what one person (i.e. me) can acheive in a limited amount of time. A third possibility is statistical learning: to observe a large number of statements containing the word "crawl", and, based on these, deduce that it is a type of locomotion.

In the following, the third approach is presumed. This is because the author has in hand both the statistical and the linguistic tools that would allow such observation and deduction to be made.

Consistency Checking

Consider the following three sentences:

  Aristotle is a man.
  Men are mortal.
  Aristotle is mortal.


  Berlin is the capital of Germany.
  Capitals are cities.
  Berlin is a city.

Assume the first two sentences were previously determined to be true, with a high confidence value. How can we determine that the third sentence is plausible, i.e. consistent with the first two sentences?

Upon reading the third sentence, it could be turned into a hypothetical statement, and suggested as the target of the PLN backward chainer. If the chainer is able to deduce that it is true, then the confidence of all three statements can increase: they form a set of mutually self-supporting statements.

So, for example, the above generate:

  isa(city, capital)
  isa(city, Berlin)

The prepositional construction XXX_of(A,B) allows the deduction that isa(XXX,A) (a deduction which can be made directly from the raw sentence input, and does not need to be processed from the prepositional form. (Right??) Certainly this is true for kind_of and capital_of, is this true for all prepositional uses of "of"?

Normally, a country can have only one capital; thus we need an exclusion rule:

  if capital_of(X,Y) and different(Y,Z) then not capital_of(X,Z)

There are potentially lots of such unique relations, so the above should be formulated as

  if R(X,Y) and uniq_grnd_relation(R) and different(Y,Z) then not R(X,Z)

Thus, we have a class of uniquely-grounded relations, of which capital_of is one. Part of the learning process is to somehow discover rules of the above form.

Using triples for input

Other problems: Consider the sentences: "A hospital is a place where you go when you are sick."

One may deduce that "A hospital is a place", but one must be careful in making use of such knowledge....

Concrete Data Representation

Let's now look at how to represent some of these ideas concretely, in terms of OpenCog hypergraphs.

First, a SemeNode will be used as the main anchor point. A SemeNode is used, instead of a ConceptNode, so as to leave ConceptNode open for other uses; the goal here is to minimize confusion/cross-talk between this and other parts of OpenCog.

Initially, when first creating a SemeNode, it should probably be given a name that is a copy of the WordInstanceNode that inspired it: "John threw a red ball" leads to

   SemeNode "ball@634a32ebc"

A basic name is needed for the concept, and so, in complete analogy betwen WordInstanceNodes and WordNodes, we create:

      SemeNode "ball@634a32ebc"
      WordNode "ball"

although the lemmatized form should probably appear here (i.e. the LemmaLink would be used to obtain the source WordNode). The idea that it's red would then be:

      DefinedLinguisticRelationNode "amod"
         SemeNode "ball@634a32ebc"
         WordNode "red"

At some point, we need to convert the above to:

      SemanticRelationNode "color_of@6543"
         SemeNode "ball@634a32ebc"
         SemeNode "red@a47343df"

This would need to work by recognition that "amod" together with "red" implies that "red" is a color. Could probably be done with a rule.

  IF amod($X,$Y) ^ is-a($X, object) ^ is-a($Y, color)
  THEN color_of($X,$Y) ^ &delete_link(amod($X,$Y))

How do we bootstrap to there? Via upper-ontology-like statments: "Red is a color" and "A ball is an object". At some later, more abstract stage, one must ask: "Is a ball the kind of object that can have a color?"; but at first, we shall start naively, and assume that it is.

Contextual representation

A given SemeNode can be relevent to one or more contexts. The relationship is indicated with a link to a named context. Say, for example that, during IRC chat, that JaredW stated that "The ball is red". We'd then have a

      ContextNode "# IRC:JaredW"
      SemeNode "ball@634a32ebc"

At this time, the creation/naming of ContextNodes would be ad-hoc, on a case-by-case basis. All input from the MIT ConceptNet project would be marked with with something like

   ContextNode "# MIT ConceptNet dump 20080605"

A SemeNode, once determined to be sufficiently general, might belong to several concepts. There might be a hierarchy of ConceptNode inclusions: so, for example, concepts in "common-sense" contexts, such as ConceptNet, would be judged to be sufficiently universal to also hold in IRC contexts, Project Gutenberg contexts, Wikipedia contexts, etc.

The only reason for using ContextLinks instead of

        DefinedRelationshipNode "Context"
            ContextNode "# IRC:JaredW"
            SemeNode "ball@634a32ebc"

Is to save a bit of RAM storage; there will be at least one ContextLink for every SemeNode.


A key step in concept formation is determining if/when two distinct instances are really the same concept. This is to be accomplished by comparing two concepts, and returning a (simple) truth value indicating the likelihood that they are the same. Many algos are possible. The simplest might be the following:

Take a weighted average of link-comparisons, comparing:

  • WordNode. A mismatch here means that it is highly unlikely that the concepts are identical, unless the WordNode is a pronoun.
  • ContextNode. A mismatch here means it's highly unlike that the concepts are identical, unless the ContextNode is one of the base "common-sense" contexts.
  • Compare modifiers. A modifier present in one, but absent in the other, is "neutral". Conflicting modifiers suggest a conceptual mis-match: If the current sentence calls a ball "green", while a previous one called it "red", then the two references are probably to two different balls. Ditto for big, small, light, heavy, etc.
  • Compare relations, e.g. capital_of, next_to, etc. Much like comparing modifiers.


  • Read sentence from some source.
  • Process sentence for triples
  • Create initial SemeNodes that match key word instances.
  • Add (temporary?) ContextLinks to indicate source.
  • Scan existing SemeNodes for possible match.
  • Merge SemeNodes if a plausible match is found. ("clustering")

Current Implementation Status

The current implementation does just about none of the whiz-bang stuff discussed above. So far, just the most basic scaffolding has been set up.

All "seme promoters" function by accepting a word-instance, and returning a corresponding seme. The various promoters are of different levels of sophistication, and use different algoorithms. They all create an InheritenceLink relating the original word instance to the seme, like so:

     WordInstanceNode hello@123
     SemeNode  greeting@789

Most/all of these also create a link to the lemmatized word form i.e. "the English-language word" that corresponds to the seme:

     SemeNode  greeting@789
     WordNode  hello

All of these promoters could be, and eventually should be re-implemented as ImplicationLinks, so that all promotion runs entirely withing OpenCog. For now, they are implemented in scheme.

There are currently three promoters


Creates a new, unique SemeNode for *every* input WordInstanceNode. That is, no two words are ever assumed to refer to the same seme.


Creates a new SemeNode only if there isn't one already having the same lemma as the word instance. That is, it assumes that any given word always refers to the same seme. This is the "opposite* behaviour from the trivial promoter.


Re-uses an existing seme only if it has a superset of the modifiers of the word-instance. Otherwise, it creates a new seme.


Suppose we have a set of (consistent) prepositional relationships. What can we do with them? For example, can we deduce that a certain verb is a type of locomotion, based on its use with regard to prepositions?

Hmm. Time to write some rules, and experiment and see what happens. Not clear how unambiguous the copulas and preps will be.


Create the following new atoms types: ContextNode SemanticRelationshipNode

ToDo: Reification of triples ... Re-examine Markov Logic Networks ... Phrasal verbs vs. prepositional phrases


[FEAT] Feature extraction. See

Alexander Yates and Oren Etzioni. Methods for Determining Object and Relation Synonyms on the Web. Journal of Artificial Intelligence Research 34, March, 2009, pages 255-296.

Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum SOFIE: A Self-Organizing Framework for Information Extraction WWW 2009 Madrid!