GSOC 2008 RelEx Web Crawler

From OpenCog
Jump to: navigation, search


THIS PAGE IS OBSOLETE

This is a Google Summer of Code 2008 proposal, any and all feedback is greatly appreciated!

About RelEx Web Crawler

One of the most important aspects of future artificial intelligences will be their ability to efficiently use the internet as a source of knowledge for helpful use in natural language conversation. This project aims to create a program which will use the internet to automatically create a database of RelEx output data, which can then be used as a knowledge base for an intelligence to analyze and assist from.

To do so, I would like to modify either the Grub Distributed Web Crawler (as suggested on the ideas page) or CMU's WebSphinx to crawl a website, and to automatically extract and analyze the usable content. This data would then be stored as a file in RelXML or OpenCogXML format for each URL crawled, which could then be reprocessed and analyzed or queried on demand.

After receiving feedback about my initial proposal and then doing some research about hypergraphs and their implementation in HypergraphDB, it seems logical that the RelEx output of each page should be automatically added into a single HGDB HyperGraph, where each XML element becomes a new Atom, unless such an Atom already exists, in which case only the appropriate connecting HGLink Atoms would be created. Once this HyperGraph is created, HyperGraphDB's built-in HGQuery operations can be used to find subsets of the hypergraph, easily extracting relationships about the entire span of crawled web pages.

This program is not particularly technically demanding, which means it should be in a very stable and usable form by the end of the project. Hopefully, this project will be a great tool for researchers looking to refine their rankings and framing rules, as well as for developers using the OpenCog platform to create intelligent agents which access online knowledge-bases.

Purpose

Our modification of Grub will be quite different from its mother in that it would not be used to crawl the whole internet, but rather a specific subset of it. It will also be generating a very different set of output data, which would be in an OpenCog format rather than GrubXML. This OpenCog-ready database could then be searched and used by other functions of an intelligent application. Consider the following example.

Example 1: Wikipedia Librarian

A Wikipedian decides he wants to use the OpenCog framework to create a Wikipedia Librarian agent (ala Snow Crash). His first step must be to point the RelEx Web Crawler at Wikipedia (or perhaps even a more specific subset, say, Wikipedia:English:British History). The crawler will retrieve all of those Wikipedia articles and process their content into an OpenCog XML format. This is then Atomized and added to a HypergraphDB. This database will later be queried (either preprocessed or on-the-fly, depending on the implementation) by the Librarian to answer different questions. For example, when asked "When was the Battle of Hastings?," the Librarian will use this database to return "The battle took place on 14 October 1066," an answer generated by querying the HyperGraph for the subset which contains the Atomic structures connecting {day,month,year,event,battle of hastings}. This is not particularly intelligent as all of that information would be contained on a single page.

However, given a more difficult question such as "What is another event that occurred in the same year as the Battle of Hastings?," a query for a subset including {{year, Battle of Hastings}, {event}} could be turned into an answer, "The 1066 Granada Massacre." This displays greater intelligence, as that information can only be gleaned by processing multiple web sites. Although this example may seem rather trivial, it is easy to see how such a properly questioned intelligence could make extremely complex associations and abstractions simply by using HypergraphDB's preexisting functions, which is why combining a RelEx Web Crawler and HypergraphDB functionality is critically relevant to OpenCog's stated goals.

Although Wikipedia is the most obvious example, these types of intelligent agents could be built around any large online knowledge base, according to a client's needs (for instance, one trained on the MSDN Library could be very helpful for software developers).

Critique

This proposal seems unrealistic for several reasons, mostly because it combines elements that are much too easy with elements that are much too hard, and treats them equally. So:

  • Web crawling and parsing. There are dozens of web crawlers out there. The ReLex parser works fine. Combining the two requires a few afternoons of shell-script hacking. A few afternoons more if you want to get fancy and whatnot. But overall, this is easy.
  • Question Answering (example 1). Simply combining a search/query system with parsed output will not result in a question-answering system. At a bare minimum, one must identify the structure of a question, and match it to the structure of possible answers; it gets harder from there. Automatic question-answering has been a topic of academic study for 50 years, producing countless PhD thesis, many interesting insights, and very little in the way of usable systems (not counting google). Creating such a thing, in the space of a month or two, on an unfamiliar code-base, by an inexperienced coder, is very unlikely. (See CogBot for the current question-answering status in OpenCog).
  • Converting RelEx output so that it can be pushed into a HyperGraph DB is a reasonable project that could be accomplished in one summer. However, the utility of this is not clear. It could be a stepping stone to other, more interesting systems, but the OpenCog project, as a whole, is not ready to take steps in those directions: simply put: right now, the OpenCog project has no use for RelEx output stuffed into a Hypergraph DB.

Thus, the proposal, while well-meaning and enthusiastic, is not realistic. It doesn't help that the proposal is mis-named: the world really doesn't need another web-crawler; yet, that is not really what this proposal was about.

Example 2: Statistical dataset

A more short-term application of building these crawled datasets will be for statistical analysis. As RelEx can generate numerous different parsings, a large dataset of these parsings in specific applications (such as literary sources, medical literature, technical journals, etcetera) could be analyzed to create better statistical parse rankings and framing rules for specific situations.

Implementation

Were the program distributed, there would have to be two parts to implement, the client and the server maintaining the database. However, for reasons given below, only the client is necessary for this implementation. As a result, database management duties shift to the client, but this is probably a logical thing to do anyway.

Client

Unfortunately, the implementation of grub is cripplingly meager. Although the revived project claims to be open source, the new code is not yet available. The older codebase is undesirable as it is in C++ and dates back to 2001.

Instead, I propose that the crawler use WebSphinx as an alternative. This loses the distributed features of grub, but this won't be a problem since the RelEx Web Crawler in its current design is suited to web regions on the magnitude of thousands of pages, not billions. Perhaps, in the very long term, a project may wish to use the 'entire' internet as a knowledge database, which would require distributed crawling, but hopefully by then JGrub will be fully developed (although there is no reason why this implementation of WebSphinx couldn't also be extended to a distributed platform.)

Hypergraph Management

After an XML file is generated for a page, the file will then be Atomized and inserted into a HypergraphDB HyperGraph. For each XML element, the hypergraph is checked to see if an Atom already exists for that element, and if not, it is created. Then, HGLink Atoms are created for the relationships which that XML element has to other Atoms in the graph. This is the area which I will need the most mentoring in, as although I can infer the natural relationships between elements, I don't yet have any experience designing hypergraphs. I will need somebody to talk to about how these relationships will arise and how to store them in the hypergraph in the most usable way possible.

After the crawl is completed, the program will have created a directory full of OpenCog/RelXML files, one for each URL crawled, and a portable HyperGraphDB file, which can be opened and processed by other applications for intelligent purposes.

Schedule

Date Goal
April 14 - May 26 Get better acquainted with the community and the software. During this period I'll familiarize myself with the code base of both RelEx and WebSphinx, and determine exactly how to go about implementing RelEx Web Crawler.
May 26 Programming begins
Mid-Late June By this point I should be able to create a crawler, crawl a selection of pages, strip their content from their HTML and begin to process it with RelEx.
Mid July By this point, the crawler should be able to effectively crawl thousands of pages, strip and process their content, and store the output XML by URL in the desired format. The crawler should be able to create an empty hypergraph.
Early August The crawler should be able to take the processed XML of each web page and feed it into the hypergraph, creating atoms for each new element and the HGLinks which that element has to other atoms.
August 11 All features that will be present in the final version should be in place by this point.
Aug. 6 - 18 Testing and bug fixing.

Summary

Summary articles are here:

The code is also here, on Launchpad, but the articles above will give you a better understanding of how to use it.

About Me

My name is Rich Jones and I am a rising junior at Boston University. I am working on a joint BA/MA in Cognitive and Neural Systems, which I will hopefully receive in 2010. I have an academic background in programing throughout high school and college, and have done some personal projects, such as an application which sorted music into RIAA and Non-RIAA categories for safer filesharing, and a Facebook application to connect people of the same genetic haplogroup. I have also done some web and database coding professionally for CVM Engineering. Lately, my pet project has been a pseudonymous and encrypted variant of the BitTorrent protocol.

I am also a huge advocate of digital rights and digital freedom, which I write about on my website, TheNewFreedom. I also founded and run the Boston University Free Culture chapter.

My complete resume is available here.

Why OpenCog

I'm fascinated with consciousness and cognition, and plan on spending my life playing with digital and augmented cognition. I'm also a massive proponent of openness in research and culture, so this institute is exactly where I would love to spend my energy and skills.

I've become realllly excited about this project just by writing the proposal! I really look forward to working with you.