AtomSpace

From OpenCog
(Redirected from Atomspace)

The OpenCog AtomSpace is a knowledge representation (KR) database and the associated query/reasoning engine to fetch and manipulate that data, and perform reasoning on it. Data is represented in the form of graphs, and more generally, as hypergraphs; thus the AtomSpace is a kind of graph database, the query engine is a general graph re-writing system, and the rule-engine is a generalized rule-driven inferencing system. The vertices and edges of a graph, known as Atoms, are used to represent not only "data", but also "procedures"; thus, many graphs are executable programs as well as data structures. These Atoms, which are permanent and immutable, can be assigned fleeting, changing Values to indicate the truth or likelihood of that atom, or to hold other kinds of transient data. The AtomSpace enables flow-based programming, where Atoms represent the pipes, and Values are what flows through the pipes.

This wiki page attempts to communicate a general flavor of what the AtomSpace is. A more principled theoretical presentation can be found in this PDF: opencog/sheaf/docs/ram-cpu.pdf (the "Graphs, Metagraphs, RAM, CPU" paper.)

Overview

The AtomSpace is a graph database that adds many features, tricks, concepts and developments not found in other systems. Here is a thumbnail sketch:

  • It adds a number of predefined relations (called Atoms) to do typical relational things. For example, InheritanceLink is the classic "is-a" relation -- x is-a y. Another is EvaluationLink, which is used for "semantic triples" -- x R y for R some arbitrary named relation R. We call R a PredicateNode, and so, for example, for "Jack owns a computer", one has x==Jack, R==owns and y==computer and so one writes (Evaluation (Predicate "owns") (List (Concept "Jack") (Concept "computer"))) In first-order logic, one writes P(x,y) instead of x R y, whence the name predicate.
That is, the Atomspace is designed to hold things like ontologies, taxonomies, and knowledge representation in general.
  • Next, a conceptual leap: one is not limited to just P(x,y), but one can have arbitrary numbers of arguments. These arguments can be other Atoms, nested arbitrarily deep, which is what makes it a "graph database". And finally, there is no force-fit schema, which is why it's not an SQL database. So, for example, "triple stores" have a force-fit schema: everything must be a triple, of the form x R y. In other words, a table with 3 columns. For the Atomspace, "anything goes". Otherwise, it would have been just SQL: it may not be obvious to some, but in fact, SQL is-a kind-of graph database (that's what "primary keys" are for!), it just forces you to pre-declare your schema, i.e. it forces you to use tables. In the AtomSpace, you can declare a schema, if you wish, but it's optional and mostly not used.
  • Next, each predicate (more generally, each Atom) has a truth-value. Classically, this is true/false (e.g. "it is true that Jack owns a computer"). The next conceptual leap is this chain of generalizations: crisp true/false -> probability -> probability+confidence -> list-of-floats -> arbitrary json struct -> arbitrary key-value-db -> arbitrary key-value-db with rapidly time-varying value-streams.
So, in this example, "Jack owns a computer" has an associated key-value database on it. One of the keys might hold the truthiness of this statement. Another key might hold its probability. Another key might hold the time-varying value of the physical distance between Jack and the computer, or maybe the pixel-values on the screen at some instant in time (e.g. "right now"). These are called "Values". The name "Value" is meant to suggest the concept of valuations in logic.
  • The distinction between Atoms and Values might seem arbitrary and pointless. It is not. Atoms define a graph, that graph can be searched with a query language. There is a real performance hit to maintaining the needed indexes to be able to perform a query: inserting and deleting atoms is slow. But at least you know how they are connected. Not so for Values: there's no index. They're not searchable. But they are fast.
The metaphor is this: imagine the AtomSpace as holding a bunch of pipes, plumbing. The Values are the water that flows through the pipes. Performance-wise, its fairly hard/slow to change the graphs, but the values can change constantly.
  • Finally, another half-dozen magic ingredients:
  • The query language is a full-fledged relational algebra, and more. This makes it far more powerful and advanced than any other query language on any other graph database (that I know of. Maybe someone else has this? I haven't seen it yet.)
  • Queries from the query language are graphs themselves. So queries can be stored in the Atomsapce. This is very unlike SQL, where you cannot store a query in the database itself. I think this is also very unlike every other graph DB (not sure). In particular, this allows you to perform reverse-queries: given an answer, find all the questions which it answers. Note that all chat-bots are in fact custom-purpose reverse-query databases (consider the I * you -> I * you too rewrite rule from AIML). The AtomSpace generalizes this.
  • Some Atoms (some graphs) are executable. For example, PlusLink knows how to actually add numbers together. PlusLink is backed by a C++ class that performs addition. This makes the AtomSpace be dual-mode: it can represent knowledge as a graph, and some graphs are executable. In particular, you can operate on Values in this way. It is a full-fledged language, and it is called Atomese. It is Turing complete -- it supports recursion, lambdas, and so on.
  • A relation P(x,y), together with it's truth-value, can be thought of as a matrix. So there is an API to access P(x,y) as if it was an actual matrix, so that you can do linear algebra with it, and so you can do probability with it (so that, for example, P(x,y) is a joint probability distribution). This may sound boring, except that the AtomSpace can naturally encode extremely sparse matrices: e.g. one-million-by-one-million. Consider, for example, the English language, with one million words and place-names. There are potentially 1M x 1M = 1 trillion word-pairs. At 4 bytes per float (to store a matrix), that would require 4 petabytes of RAM (or disk!) if it were not sparse. But, in fact it is, and the AtomSpace is ideal for storing sparse multi-dimensional data. That is, a single Atom (say, a Link) is single site, a location in a tensor; an associated numeric Value is the value. We are using this to explore neural nets, which currently are limited to vectors of dimension of less-than a few-hundred, or thousand at most.
  • A relation P(x,y,z,...) can be considered to be a parse-table entry. These can be joined together (by contracting "tensor indexes") to obtain parses. Alternately, it can be considered to be a sequent in natural deduction, and so one can perform theorem proving. This is an active area of research for the AtomSpace.

Thus, the AtomSpace brings together, in one place, a large collection of different-but-related concepts, fairly well-balanced as a software platform, and allows experimentation and research in how they fit together. The point is that all of these things are hot topics these days: graph DB's and neural nets and theorem provers and natural-language systems. You can explore them in isolation, of course, or you can explore how they all interconnect with one-another. This is what the AtomSpace provides: a place to explore the synergy between different theories of data and computation.

Position Statement

The AtomSpace can be thought of in several different ways (outlined next). Primarily, it is meant to be a graph database for storing knowledge. It's design is driven by several philosophical principles; the primary principle is that "all state should be visible to all algorithms". A variant of this principle is well-known to designers of distributed computing systems (i.e. "the cloud"): to move state from one computer to another, the state has to be where you can find it and grab it and transport it. A variant of this shows up in distributed functional languages, like Scala, or functional languages in general, such as Haskell. The AtomSpace tries to extend this principle not only to data movement, and data storage, but to general AI algorithms: learning, logical inference, data mining. In order for some algorithm to perform some reasoning about some data, that algorithm needs to be able to find that data. The place to "find it" is in the atomspace.

To summarize the philosophy: all OpenCog state is in the Atomspace. There isn't any state that isn't in the AtomSpace; it can't be found under a rock, or tucked away in some object. The AtomSpace is one giant closure.

As a database

The AtomSpace is effectively a database, and has many common database-like properties. The basic data structure is that of a graph; so the AtomSpace is a graph database, and is used primarily as an in-RAM database.

The basic structures used to represent the graph are Nodes and Links. These can be created (in RAM), manipulated and traversed without using the AtomSpace. The AtomSpace provides a database structure that is not available by using "naked" atoms:

  • Uniqueness of Atoms. An AtomSpace will contain only one single Node of a given name, and one single Link with a given outgoing set. This is analogous to a key-constraint in an ordinary database, so that, for example, employee "Jane Doe", her salary and her title is stored uniquely, once, with a unique identifier. This helps ensure data reliability, e.g. by avoiding accidentally storing two different salaries for the same person.
  • Indexes. The AtomSpace contains several indexes to provide fast access to certain classes of atoms, such as queries by atom type, by atom name, by outgoing set, and the like. Just as in ordinary databases, indexes are used to speed the search: they affect the overall performance, but are not otherwise directly visible to the user.
  • Persistence. The AtomSpace is primarily used as an in-RAM database, with all of the performance advantages of having data cached in RAM. However, sometimes you need to store that data for longer periods of time; thus, the contents of an AtomSpace can be saved-to/restored-from some storage medium. One currently-used back-end is Postgres, (a traditional SQL database).
  • Decentralized computing vs. distributed computing. The design philosophy of encoding all state as graphs, and disallowing any hidden state means that atoms are easily transported from one computer to another, or shared among network servers. This is achieved by delegating this function to the back-end. There are many strong, scalable, mature distributed computing solutions; rather than re-inventing this technology, the AtomSpace uses them via the backend API. In the current implementation, this means Postgres. See ProxyNode and Decentralized AtomSpace for details.
  • Query language. A query language provides a way for searching for and finding the data contained within them. The AtomSpace provides a sophisticated and powerful pattern search language, Atomese. It differs from SQL because AtomSpace structures are graphs, not rows in a table. It differs from Gremlin, in that it works at a higher, more abstract level, offering more powerful and refined constructs (basically, its more compact, and easier-to-use). It is partly inspired by DataLog, when the graphs encode logical expressions. We sometimes call it "pattern matching", but this name is perhaps misleading, because the programming language, Atomese, is stored in the AtomSpace itself (and thus differs from the pattern matchers built into Lisp, Haskell, and other programming languages).
  • Change notification. The AtomSpace provides signals that are delivered when an atom is added or removed, thus allowing actions to be triggered as the contents change.

As a symbol table

Once an atom is placed in an atomspace, it becomes unique. Thus, atoms are the same thing as symbols, and the atomspace is a symbol table. The unique ID of a Node is it's string name, and the unique ID of a Link is it's outgoing set.

Examples of symbol tables can be found in python and scheme. In Prolog, symbols are called atoms. While Prolog makes a distinction between atoms, variables and (compound) terms, OpenCog does not: all of these are considered to be atoms of different types (e.g. the VariableNode).

In some programming languages, such as scheme and LISP, symbols can be given properties. The analogous concept here is that of a valuation: each OpenCog atom can be given various different properties, or values.

As a tuple space or blackboard

The intent of the AtomSpace is to offer persistent manipulable storage. As such, it resembles a tuple space. Some tuple spaces take the form of "Object Spaces"; however, unlike an Object Space, the AtomSpace does not offer mutual exclusion of object access. Instead, both Atoms and Values are immutable, while the association between an Atom and a Value can be freely updated by anyone at any time.

Insofar as arbitrary knowledge can be place into the AtomSpace, it roughly follows the blackboard design pattern, that is, it resembles a form of blackboard system.

As a knowledge representation system

Knowledge representation and reasoning in the AtomSpace is primarily accomplished with a type system, formalized on type theory. The goal of formalizing the system is to avoid ad hoc design and implementation decisions. The formality makes it easier to transpose modern AI research results and algorithms into Atomese.

Atomese provides pre-defined Atoms for many basic knowledge-representation and computer-science concepts. These include Atoms for relations, such as similarity, inheritance and subsets; for logic, such as Boolean and, or, for-all, there-exists; for Bayesian and other probabilistic relations; for intuitionist logic, such as absence and choice; for parallel (threaded) synchronous and asynchronous execution; for expressions with variables and for lambda expressions and for beta-reduction and mapping; for uniqueness constraints, state and a messaging "blackboard"; for searching and satisfiability and graph re-writing; for the specification of types and type signatures, including type polymorphism and type construction (dependent types and type variables TBD).

Atomese

Because of these many and varied Atom types, constructing graphs to represent knowledge looks kind-of-like "programming"; the programming language is informally referred to as Atomese. It vaguely resembles a strange mashup of SQL (due to queriability), prolog and datalog (due to the logic and reasoning components), intermediate languages (due to graph rewriting), LISP and Scheme (due to lambda expressions), Haskell and CamL (due to the type system) and rule engines (due to the forward/backward chaining inference system).

This "programming language" is NOT designed for use by human programmers (it is too verbose and awkward for that); it is designed for automation and machine learning. That is, like any knowledge representation system, the data and procedures encoded in Atomese are meant to be accessed by other automated subsystems manipulating and querying and inferencing over the data/programs. More narrowly, it looks like an intermediate language seen inside of compilers and dynamic programming languages: it represents knowledge as abstract syntax trees, which can be manipulated, altered and optimized (that is, the optimizer inside a compiler is an example of a term rewriting system, designed to transform a source language into an optimized machine language; the intermediate language is where the optimization takes place)

Atomese has aspects of declarative programming, procedural program and functional programming (as well as query programming, of course). Viewed as a programming language for procedural or functional programs, the current implementation of Atomese is actually very slow, inefficient and not scalable, mostly because it is interpreted, rather than compiled. The API itself provides a natural way to compile and cache bytecodes; its just not implemented. Until recently, the design has been driven by the need to enable generalized manipulation of declarative networks of probabilistic data by means of rules and inferences and reasoning systems. That is, it extends the idea of probabilistic logic networks to a generalized system for automatically manipulating and managing declarative data.

As a research platform

The use of the AtomSpace, and the operation and utility of Atomese, remains a topic of ongoing research and change, as various dependent subsystems are brought online. These include machine learning, natural language processing, motion control and animation, planning and constraint solving, pattern mining and data mining, question answering and common-sense systems, and emotional and behavioral psychological systems. Each of these impose sharply conflicting requirements on the system architecture; the AtomSpace and Atomese is the current best-effort KR system for satisfying all these various needs in an integrated way. It is likely to change, as the various current short-comings, design flaws, performance and scalability issues are corrected.

The examples directory contains demonstrations of the various components of the AtomSpace, including the python and scheme bindings, the pattern matcher, the rule engine, and many of the various different atom types and their use for solving various different tasks.

Hypergraph Database Design Requirements

The AtomSpace is a container for storing (hyper-)graphs. It is optimized for multiple styles of data and algorithms. These include:

  • Working with (extremely) large sparse vectors of complex data structures, such as those found in linguistics.
  • Providing support for (symbolic) inference engines and theorem provers.
  • Provide a general query system capable of not only solving the general graph matching problem, but do so with very high performance.
  • Scale up to extremely large datasets. Current production applications can work with hundreds of millions of Atom]s; this is more than can fit in RAM (for ordinary mid-range server systems).
  • Provide network-distributed storage, and generalized processing and data-flow pipelines. The goal is to have servers that can process and transform symbolic, graphical data located in ... large sparse vectors of complex structured (and unstructured) data.

Current Implementation

This section reviews some aspects of the current implementation, including why certain design choices were made, and also some general performance bottlnecks and issues.

General Critique

Some features and issues.

  • The AtomSpace can be accessed using C++, Scheme, Python, OCaML or Haskell. Low-level access via C++ is about 4x faster than access with scheme, and 10x faster than with python; thus scheme and python are best suited for invoking high-level functions. Guile version 3.0 adds compiled bytecode via GNU Lightning, and so begins to approach C++ for performance. The primary bottleneck here is having to constantly jump between the C++ and the guile (or python) code. This is solved by moving more and more performance-critical code to C++.
  • When code is moved "from scheme/python to C++", the best practice is to design a new Atom Type, so that the type itself does the work. For example, TimesLink can (pointwise) multiply two vectors, and AccumulateLink can sum up vector components. Thus, a dot-product can be written in Atomese as (Accumulate (Times vectorA vectorB)). This allows the dot-product to be stored in the AtomSpace itself (and thus queried, manipulated, transformed ...). This is vastly superior to creating a brand-new black-box C/C++ function std::vector<double> dot_product(std::vector<double> A, std::vector<double> b) living in some shared lib. Thus, the AtomSpace, as a general rule, has relatively few "utility functions" and "toolbox libraries". Pretty much everything important is expressed in Atomese. This is why there are so many Atom Types (over 150 in Category:Atom Types). There are exceptions. The biggest and ugliest is the Unifier (in the URE github repo). Also, the matrix subsystem has not been wrapped yet. It wasn't clear how to do this until recently.
  • The static (compile-time) type system in OCaML and Haskell collides with the dynamic type system in the AtomSpace. Thus, any grand dreams of leveraging modern programming languages with fancy type systems are dashed. It turns out OCaML is all about compile-time type safety, while the AtomSpace is all about flexible representations using types. Programming languages with type erasure are the worst: the AtomSpace is all about remembering types, and not erasing them.
  • Network interfaces to the AtomSpace are provided by the CogServer. The cogserver provides direct (TCP/IP) access to Atomese: you can send and receive Atoms. This includes both a "plain" telnet interface and a WebSockets interface. There are multiple network shells: a scheme network REPL shell, so that you can execute arbitrary scheme at the prompt, a python REPL shell, a JSON REPL shell and a "bare s-expression" (sexpr) shell, providing high-speed Atomese. The Atomese server is faster because it reads text straight off the socket, and manipulates the AtomSpace directly; there are no interpreters or other layers to get in the way.
  • Things that were tried and did not work out, because they were too slow and bloated: RESTful API, ZeroMQ, Neo4J, Protocol Buffers, JSON. The primary problem in each case was that Atoms are tiny, and converting them from native Atomese to other formats is a giant waste of CPU time. Any kind of pre- or post-processing is much much slower than actual AtomSpace internals, so almost any manipulation or conversion will be ... slow, a bottleneck. The bare s-expression interface, running over plain TCP/IP sockets or WebSockets, is a lightning bolt speed daemon compared to these systems.
  • Database dreams that didn't work out. One might hope that the AtomSpace could be "easily" layered on top of existing, popular databases, and thus gain robustness, flexibility, network distributivity. This is easier said than done. The current Postgres backend "works" (and has been heavily used in production on huge datasets) but, in retrospect, was mis-designed. By contrast, the RocksStorageNode works great, its awesome! One reason for this is that, for the Rocks backend, the bare-naked s-expression strings are stored. These can be rapidly serialized/deserialized into AtomSpace (in-RAM) Atoms. They are ridiculously tiny: Rocks compresses these strings into a few dozen bytes, and so a database with 100 million Atoms in it has a very tolerable disk footprint of a few GBytes. The mistake with Postgres was to try to break up Atoms into individual SQL table pieces-parts. This made the serialization/deserialization of Atoms into SQL table rows complicated and slow.
  • Competition for RAM is a real danger. The size of an AtomSpace working set is limited by system RAM. Although an Atom is just dozens of bytes when written to disk, it blows up to about 1.5KBytes in RAM. This is because the in-RAM representation keeps multiple indexes: all Atoms by type, all Atoms in the incoming set, all Atoms in the outgoing set, and all of the Values attached to an Atom. These indexes add up. Loading 20 million Atoms into an active AtomSpace requires about 30GBytes of RAM, which is significant. The last thing that you want is some other database, e.g. Redis, an in-RAM database, competing for the available RAM. This would be a big F Fail.
  • UUID's are a big mistake. A UUID is an attempt to associate an integer, a unique number, a Universally Unique ID, with each Atom. This seems appealing because one imagines that it's easier to work with UUID's than with Atoms. Not so. First, one needs to have two lookup tables: UUID->Atom and Atom->UUID. Both of these eat (precious) RAM. Given that most Atoms are tiny -- hundreds of bytes, the size of these lookup tables is significant. They are bottlenecks for threading: they need to be updated in a thread-safe fashion, which means one big giant centralized lock that everyone is scrambling to get. Whoops. Finally, there is the issue of making UUID's actually unique: if Deanna on Earth and Elon on Mars both create some Atom, how are they going to assure that the UUID they pick is actually unique? The Birthday paradox destroys UUID's.
  • Access to the AtomSpace via C++ and scheme (guile) is completely thread-safe. It has been used in production systems, pumping through tens of billions of Atoms in dozens of threads, with run-times extending into weeks, without crashing. The multi-threading bottleneck is CPU-hardware dependent: The AtomSpace makes heavy use of the C++ std::shared_ptr<> template class, which provides atomic reference-counted pointers. The atomic locks are implemented in CPU cache-line hardware, and some CPU's have more than others. Although the number of threads you can create is unlimited, the number that will actually run at the same time is limited by the number of such hardware locks in the CPU. Caveat Emptor. The AtomSpace benchmark tool can measure this number. The AMD Opteron 12-core 6344 (Abu Dhabi) from 2014 seems to have four of these, and thus scales (with difficulty) up to around three++ threads. The 4-core AMD Ryzen 5 3400G with 8 hypercores (a mid-range desktop CPU) scales to eight threads! Yay! Hurrah! But we are not always so lucky: the AMD Ryzen 9 3900X 12-Core Processor (with 24 threads) barely gets up to a 5x speedup. Running 24 threads gives only a 7x speedup. Booo-hiss. That's losing. AMD removed something important in the later Ryzens that the earlier Ryzens had plenty of.
  • The python interface is probably thread-safe, but always serializes to a single thread (because this is how python is designed). No one has used python in production.
  • A GNU R interface does *NOT* exist, but looks like it could be a major win! This could vastly improve usability for the biology and genomics projects (where most of the code is in R). See opencog/matrix for details.
  • AtomSpace contents may be viewed with several different visualization tools, including the AtomSpace visualizer. They are all broken, and don't work. There is a subtle and important reason for this: visualizing large, complex, tangled networks is hard. Visualizer authors develop and test with a few hundred Atoms, and choke when given 20 million of them. Visualization requires tight integration with the query engine, so that only the relevant Atoms are displayed, and only the relevant connections are followed. But it's actually worse than this: Atoms themselves are too low-level. For example, a single Link Grammar disjunct can be written as A+ & B+ & C-, which is a single "thing" with three connectors on it. Written out as Atoms, its ten Atoms (3 ABC + 3++- + 3 groupings + 1 wrapper) Visualizing this as ten Atoms is a mistake. Visualizers need to understand what is being visualized.
  • The AtomSpace benchmark tool measures the performance of a dozen basic atom and AtomSpace operations. A diary in the benchmark directory maintains a historical log of measurements. There are lots of different components and subsystems that can be measured, and measured in lots of different ways.

Reference counting vs. Garbage collection

Atoms, as objects in C++, are accessed by means of Handles; a Handle is a smart pointer (a C++ std::shared_ptr<Atom>) to an Atom. Knowing the Handle is enough to retrieve the corresponding Atom. When all of the Handles to an atom are gone, then the Atom is deleted. The AtomSpace is a big bucket of Handles, making sure they don't get deleted while no one else is holding them.

Smart pointers implement memory management by reference counting. They are "opposite" (in the category-theoretic sense) to garbage collection. There was a version of the AtomSpace that attempted to use BoehmGC (BDWGC) for garbage collection. This worked poorly, primarily because C++ intrinsics (for example, std::vector, std::map, etc.) use so many pointers and structures internally that very long reference chains show up, often more than ten hops long; garbage collection has trouble chasing such extremely long chains. Basically, a memleak. Apparently, C++ as currently implemented is not really compatible with GC as currently implemented.

The Query Engine

A generic query engine has been implemented. Queries may be specified as hypergraphs themselves, using SatisfactionLink, BindLink and GetLink. Procedures may be triggered using ExecutionOutputLink. Low-level access to the query engine is possible by coding in C++, scheme or python.

This is a major, central component of item in the AtomSpace. The query engine is large, complex and very very general. It offers a lot of very sophisticated search features, stuff that is not available in SQL or other *QL (... graphQL, WebQL,.. etc.) systems.

Almost everyone who first encounters the AtomSpace query engine underestimates it's capabilities, features and performance. It's sometimes dismissed as "oh its just a pattern matcher". It's a whole lot more than that. Sometimes criticism is from the opposite direction: "its too complex, all we need is something simpler". Such users invariably find that their freshly-minted "simpler system" is dirt-slow and doesn't really do the one thing that was important.

The query engine scales to billions of Atoms with minor (negligable?) performance impact, and can pattern match extremely complex graphs containing many, many clauses. It's got a zillion bells & whistles of fancy things it can do (see the github page for a listing)

Conclusion: the AtomSpace query engine is a big deal. Don't misunderstand just how big a deal it is. This is major.

Frames, Layers, Multiple AtomSpaces

Multiple atom spaces within a single address space can be used, either independently, or layered, one on top of another. Layering is typically useful for creating a temporary AtomSpace that contains all the Atoms in the main AtomSpace, but will be discarded after use. Layering is also useful when the main AtomSpace is extremely large (e.g. containing biological or genomic data) and needs to be shared in a read-only fashion between multiple users. A non-shared read-write AtomSpace can be layered on top of the read-only AtomSpace; this allows Values to be edited and altered, without corrupting the data in the base AtomSpace.

Deep stacks and DAGs of AtomSpaces can be used to store inference frames (Kripke frames) and context-specific information. Each frame is a changeset, a delta of everything that has changed between the current AtomSpace, and those below. When an AtomSpace inherits from multiple contributors, the contents of that AtomSpace is the set-union of the contributing spaces. In this way, a DAG (directed acyclic graph) of AtomSpaces can be created, with both converging and diverging branches. This is useful for deep inference chains, and for life-long learning, where a chain of snapshots needs to be maintained and controlled.

For multiple network-connected machines, take a look at the CogStorageNode and the ProxyNode.

StorageNodes, Persistence

The StorageNode subsystem provides a way to store Atoms to disk, to SQL, and to pass them around on the network. The ProxyNode provides a way for routing Atoms on the network (or between StorageNodes). This includes bulk transformations and processing, as the Atoms are being moved around.

As a general rule, storage to disk is pretty fast; network I/O is slower, and very noticeably slow, as compared to direct in-RAM access to Atoms.

The goal of having multiple servers with multiple AtomSpaces is scalability and performance: to reap the miracle of distributed processing.

The actual, achievable scalability depends both on the scalability of the back-end, and locality of the algorithm. The usual laws of scaling apply: "embarrassingly parallel" algorithms scale easily, strongly connected networks do not. It appears that natural language data follows a (square-root) Zipf distribution; the resulting networks are scale-free or have "simple" power-law scalaing. This suggests that algorithms processing natural language data should scale reasonably well, as long as they don't modify Values attached to the highly connected Atoms.

Working with AtomSpaces

There are dozens of detailed examples to be found in the /examples/atomspace directory. Below is a brief glimpse at the API for creating and working with AtomSpaces themselves (and not Atoms).

Scheme

If you are using the scheme bindings:

  • ,apropos atomspace -- list all atomspace-related functions available to the scheme programmer. These are:
  • cog-atomspace -- get the current atomspace used by scheme
  • cog-set-atomspace! ATOMSPACE -- set the current atomspace
  • cog-new-atomspace -- create a new atomspace
  • cog-atomspace? OBJ -- return true if OBJ is an atomspace, else return false.
  • cog-atomspace-env ATOMSPACE -- get the parent of the ATOMSPACE
  • cog-atomspace-stack -- A stack of atomspaces
  • cog-push-atomspace -- Push the current atomspace onto the stack
  • cog-pop-atomspace -- Pop the stack

You can get additional documentation for each of the above by typing ,describe func-name. So, for example ,describe cog-new-atomspace prints the documentation for the cog-new-atomspace function.

You can share an atomspace with python by using this:

  • python-call-with-as FUNC ATOMSPACE -- Call the python function FUNC, passing the ATOMSPACE as an argument.

Say ,describe python-call-with-as to show the documentation for this function.

The above is provided by the (opencog python) module. For example:

   $ guile
   guile> (use-modules (opencog) (opencog python))
   guile> (python-eval "
   from opencog.atomspace import AtomSpace, TruthValue
   from opencog.atomspace import types

   def foo(asp):
         TV = TruthValue(0.42, 0.69)
         asp.add_node(types.ConceptNode, 'Apple', TV)
   ")
   guile> (python-call-with-as "foo" (cog-atomspace))
   guile> (cog-node 'ConceptNode "Apple")

The last line will confirm that the actual desired concept 'Apple' was created, and that it has the correct TV, as specified in the python snippet.

Python

If you are in python, you have these functions available:

  • AtomSpace() -- create a new atomspace
  • scheme_eval_as(sexpr) -- evaluate scheme sexpr, returning an atomspace.
  • set_type_ctor_atomspace(ATOMSPACE) -- set the atomspace that the python type constructor uses.

An example usage of some of the above:

  from opencog.atomspace import AtomSpace, TruthValue
  from opencog.atomspace import types
  from opencog.scheme_wrapper import scheme_eval_h, scheme_eval_as
 
  # Get the atomspace that guile is using
  asp = scheme_eval_as('(cog-atomspace)') 

  # Add a node to the atomspace
  TV = TruthValue(0.42, 0.69)
  a1 = asp.add_node(types.ConceptNode, 'Apple', TV)

  # Get the same atom, from guile's point of view:
  a2 = scheme_eval_h(asp, '(cog-node \'ConceptNode "Apple")')

  # Print true if they are the same atom.  They should be!
  print "Are they equal?", a1 == a2

The Python APIs are barely maintained; many of the latest and greatest features and functions are missing. There aren't many Python users, there hasn't been a push to keep it up-to-date.