Pattern miner

From OpenCog
(Redirected from Pattern mining)
Jump to: navigation, search

Introduction

Our Pattern Miner is to mine frequent and interesting patterns from Atomspace. It's a hypergraph pattern miner.

What is a pattern

A pattern is a connected set of links and nodes with variables that occurs in the Atomspace repeatedly for multiple times. For example: Americans drink Coke:

(EvaluationLink
   (PredicateNode "drink") 
   (ListLink 
      (VariableNode "$var_1") 
      (ConceptNode “Coke"))) 
(InheritanceLink
   (VariableNode "$var_1") 
   (ConceptNode "American"))

If many instances can be found in the Atomspace that fit this pattern, then this is a frequent pattern. Such instances would be like:

(EvaluationLink
   (PredicateNode "drink") 
   (ListLink 
      (VariableNode "Ben") 
      (ConceptNode “Coke"))) 
(InheritanceLink
   (VariableNode "Ben") 
   (ConceptNode "American"))

(EvaluationLink
   (PredicateNode "drink") 
   (ListLink 
      (VariableNode "Andrew") 
      (ConceptNode “Coke"))) 
(InheritanceLink
   (VariableNode "Andrew") 
   (ConceptNode "American"))

... and so on

Pattern Gram

  • 1 gram pattern contains 1 Link .
  • 2 gram pattern contains 2 Links.
  • N gram pattern contains n links.
  • Gram generally indicates the size of a pattern.

Pattern growing

First, we need to extract all the 1-gram patterns from the Atomspace. Considering a full combination of all the Nodes in every link (which nodes are considered as variables, which nodes are considered as const), usually there will be several patterns can be extract from one single link. For example:

Extract frist gram patterns.jpg

And then using Pattern Growing Algorithms to extend patterns of n gram to n+1 gram patterns by adding other subhypergraphs which are connected to the n gram patterns.

Pattern growing.png

Interesting Pattern Mining

Most of frequent patterns are very boring. We use information-theoretical method to find interesting patterns. The details of our interestingness measure method is given on the pages Interesting Pattern Mining and Measuring Surprisingness

Tutorial of running Pattern Miner in Opencog

Where to find the source code

The pattern mining core algorithm code is in:

<OPENCOG_REPO>/opencog/opencog/learning/PatternMiner/

A non-distributed (run on a single machine) Pattern Miner and a distributed version Pattern Miner are implemented. The non-distributed agent can be found (under the above directory):

TestPatternMinerAgent.h
TestPatternMinerAgent.cc

The distributed server and client agents can be found:

DistributedPatternMinerServer.h
DistributedPatternMinerServer.cc
DistributedPatternMinerClient.h
DistributedPatternMinerClient.cc

Install dependent libraries

  • Install boost

You may need to install the full boost. Maybe you already installed before.You may only installed part of boost before. Both cpprest and pattern miner depend on many different sub libs of boost, so you'd better installing the full boost.

sudo apt-get install libboost-all-dev
  • Install cpprest

See the install instruction at https://github.com/Microsoft/cpprestsdk/wiki/How-to-build-for-Linux. Remember to do make install after you make.

Steps to run a non-distributed pattern miner test

  • test corpus file :

Go to <OPENCOG_REPO>/opencog/learning/PatternMiner/ , make sure file ugly_male_soda-drinker_corpus.scm is in this folder.

  • Compile opencog.
  • Start Cogserver (see Starting_cogserver to learn how to). If the Cogserver is started successfully, you should be able to see the following output:
username@xxxxx:~/opencog/build$ ./opencog/cogserver/server/cogserver -c ../lib/opencog_patternminer.conf 
...
...
Listening on port 17001

If it's connected to the Cogserver, you should be able to see the following output:

username@xxxxx:~/opencog/build$ rlwrap telnet localhost 17001
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
opencog> 
  • Load the example knowledge base into the AtomSpace:
opencog> scm
guile> (load "../opencog/learning/PatternMiner/ugly_male_soda-drinker_corpus.scm")
  • Run pattern miner:

Press Ctrl + D or type in "." to go back to opencog shell, and type:

opencog> loadmodule opencog/learning/PatternMiner/libPatternMinerAgent.so

If it is loaded successfully, you should be able to see the following output:

done
opencog> 
  • And then go to the Cogserver Terminal, you should be able to see a log pattern mining processing messages printing out.
Debug: PatternMining start! Max gram = x, mode = Depth_First
Start thread 0 from 0 to xxxxxxxxx

If the test corpus is very small, it will finish immediately. Otherwise wait till it finished. When it is finished, you will see the following message:

100% completed in Thread 0.d 0.
Finished mining 1~4 gram patterns.

processedLinkNum = 60
Debug: PatternMiner:  done frequent pattern mining for 1 to 4gram patterns!
gram = 1: 25 patterns found! 
Debug: PatternMiner: writing  (gram = 1) frequent patterns to file FrequentPatterns_1gram.scm

gram = 2: 9 patterns found! 
Debug: PatternMiner: writing  (gram = 2) frequent patterns to file FrequentPatterns_2gram.scm

gram = 3: 6 patterns found! 
Debug: PatternMiner: writing  (gram = 3) frequent patterns to file FrequentPatterns_3gram.scm

gram = 4: 1 patterns found! 
Debug: PatternMiner: writing  (gram = 4) frequent patterns to file FrequentPatterns_4gram.scm


Calculating interestingness for 2 gram patterns by evaluating surprisingness
100% completed.Debug: PatternMiner:  done (gram = 2) interestingness evaluation!9 patterns found! Outputting to file ... 
Debug: PatternMiner: writing (gram = 2) interesting patterns to file SurprisingnessI_2gram.scm

Debug: PatternMiner: writing (gram = 2) interesting patterns to file SurprisingnessII_2gram.scm

Debug: PatternMiner: writing  (gram = 2) pattern statics to csv file PatternStatics_2gram.csv
surprisingness_II_threshold for 2 gram = 0.9
Debug: PatternMiner: writing (gram = 2) final top patterns to file FinalTopPatterns_2gram.scm


Calculating interestingness for 3 gram patterns by evaluating surprisingness
100% completed.Debug: PatternMiner:  done (gram = 3) interestingness evaluation!6 patterns found! Outputting to file ... 
Debug: PatternMiner: writing (gram = 3) interesting patterns to file SurprisingnessI_3gram.scm

Debug: PatternMiner: writing (gram = 3) interesting patterns to file SurprisingnessII_3gram.scm

Debug: PatternMiner: writing  (gram = 3) pattern statics to csv file PatternStatics_3gram.csv
surprisingness_II_threshold for 3 gram = 0.7
Debug: PatternMiner: writing (gram = 3) final top patterns to file FinalTopPatterns_3gram.scm


Calculating interestingness for 4 gram patterns by evaluating surprisingness
100% completed.Debug: PatternMiner:  done (gram = 4) interestingness evaluation!1 patterns found! Outputting to file ... 
Debug: PatternMiner: writing (gram = 4) interesting patterns to file SurprisingnessI_4gram.scm
Pattern Mining Finish one round! Total time: 0 seconds. 
1 threads used. 
Corpus size: 60 links in total. 
Pattern Miner application quited!
  • Check the results: go to <OPENCOG_REPO>/opencog/build/, there should be some result file generated:
FrequentPatterns_xgram.scm
SurprisingnessI_xgram.scm
SurprisingnessII_xgram.scm
FinalTopPatterns_xgram.scm

Results Samples

You can open the corpus file ugly_male_soda-drinker_corpus.scm. It's a tiny made up corpus contains 10 men and 10 women, some are ugly, some drink soda. Let's look at the results. Take FrequentPatterns_3gram.scm as example. An example in the this result file is:

Pattern: Frequency = 10
(InheritanceLink )
  (VariableNode $var_1)
  (ConceptNode human)

(InheritanceLink )
  (VariableNode $var_1)
  (ConceptNode man)

(InheritanceLink )
  (VariableNode $var_1)
  (ConceptNode soda drinker)

Only those patterns with high enough frequency, high enough Surprisingness I and II will be picked as the final patterns. Open the FinalTopPatterns_3gram.scm file, you will see the only pattern is considered as interesting in this test is:

This pattern means: "Men who drink soda are usually ugly"

Pattern: Frequency = 10, SurprisingnessI = 11, SurprisingnessII = 0.85
(InheritanceLink )
  (VariableNode $var_1)
  (ConceptNode man)

(InheritanceLink )
  (VariableNode $var_1)
  (ConceptNode soda drinker)

(InheritanceLink )
  (VariableNode $var_1)
  (ConceptNode ugly)

Steps to run a distributed pattern miner test

  • Prepare machines

You should have at least two machines, one with at a lot of memory (like 32 GB) should the central server, and any small machine (like 2 GB memory) can be used to run the clients. Of course, if you just want to run a small test, you can still used the same corpus ugly_male_soda-drinker_corpus.scm so that you don't need a big server machine. All your machines should in the same LAN, so that they can communicate with each other.

  • Compile Opencog in all your machines.

If you didn't install cpprest, when you run cmake of opencog, it will output "cpprest not found. Pattern Miner will not be built." If you installed cpprest, cmake will output "cpprest is found."

  • test corpus file :

The test database is from http://wiki.dbpedia.org/ , which extracts the infobox information from Wikipedia pages, contains around half million pieces of truth. Go to https://github.com/opencog/test-datasets/tree/master/pattern%20miner/opencog/learning/PatternMiner/ and download the file dbpeida_data.zip, upzip it. It contains 4 files:

dpedia.scm  - the whole test database
dpedia_part_1.scm - the first part of dpedia.scm 
dpedia_part_2.scm - the second part of dpedia.scm 
dpedia_part_3.scm - the thrid part of dpedia.scm 

Copy dpedia_part_1.scm to /opencog/learning/PatternMiner/ of your first client machine, dpedia_part_2.scm to your second client machine, dpedia_part_3.scm to your third machine. If you only have one client machine, then just copy dpedia.scm to your client machine. The mining client will not use a lot of memory, as long as it can load the corpus. You don't need to put any corpus file into the server machine.

  • Config file in the server

Open the /opencog/build/lib/opencog_patternminer.conf file.

    • Comment out the line "learning/PatternMiner/ugly_male_soda-drinker_corpus.scm" , you don't need to load any corpus in the server.
    • Find "Pattern_Max_Gram" and set it to 3, because 2-gram patterns are too shallow, but 4-gram patterns are too many to mind, you will run out of memory probably.
Pattern_Max_Gram = 3
    • Find "PMCentralServerIP" and set it to your server IP address, "120.0.0.1" won't work. For example:
PMCentralServerIP = "192.163.0.110"
  • Config file in your clients

Open the /opencog/build/lib/opencog_patternminer.conf file in each of your client machine.

    • Comment out the line "learning/PatternMiner/ugly_male_soda-drinker_corpus.scm", unless you still use this corpus as the test corpus.
    • Uncomment the corresponding one in the follows , which you want to run in this client:
#                        learning/PatternMiner/dpedia.scm
#                        learning/PatternMiner/dpedia_part_1.scm
#                        learning/PatternMiner/dpedia_part_2.scm
#                        learning/PatternMiner/dpedia_part_3.scm
    • Find "Pattern_Max_Gram" and set it to 3, because 2-gram patterns are too shallow, but 4-gram patterns are too many to mind, you will run out of memory probably. This should be set the same as how you set your server config file.
Pattern_Max_Gram = 3
    • Find "PMCentralServerIP" and set it to the same IP address as you have set in your server config file.
  • Start the pattern miner central server

In your server machine: The same run the non-distributed one, first start Cogserver and connect to Cogserver by rlwrap telnet. When it output:

username@xxxxx:~/opencog/build$ rlwrap telnet localhost 17001
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
opencog> 

Run pattern miner central server agent:

opencog> loadmodule opencog/learning/PatternMiner/libDistributedPatternMinerServer.so

If it is loaded successfully, you should be able to see the following output:

done
opencog> 

And it should out put this message in the cogserver terminal:

Pattern Miner central server started! x threads using to parse patterns.
  • Start a pattern miner client

In one of your client machine: The same run the non-distributed one, first start Cogserver and connect to Cogserver by rlwrap telnet. When it output:

username@xxxxx:~/opencog/build$ rlwrap telnet localhost 17001
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
opencog> 

Run pattern miner client agent:

opencog> loadmodule opencog/learning/PatternMiner/libDistributedPatternMinerClient.so

If it is loaded successfully, you should be able to see the following output:

done
opencog> 

And it should out put such a message in the cogserver terminal:

Current client UID = bc4c1a62-9f54-4859-bcb8-94f19fd19332
Registering to the central server: 192.168.0.3

And then if it succeeded in connecting to the server, it will output:

Registered to the central server successfully! 
Start pattern mining work! Max gram = 3, mode = Depth_First
Start thread 0 from xxxx  to xxxxx 

At the same time, the server terminal should output:

A new worker connected! ClientID = 
ClientUID=bc4c1a62-9f54-4859-bcb8-94f19fd19332
xxxx received patterns parsed...

Start all your other clients in the same way.

  • Finish Mining

When a client finishes mining, it will output:

100% completed in Thread x.
Finished mining 1~x gram patterns.

processedLinkNum = xxxxxx 
Totally xxxxxxx patterns found!
Current pattern mining worker finished working! Total time: xxxxxx seconds. 

And then it will report to the central server that this client stopped, if the central server receives the report, the client will output the following message and quit:

Report to the central server this worker stopped successfully! 
Client quited!

At the same time the server will output:

Got request to ReportWorkerStop: ClientID = 
ClientUID=bc4c1a62-9f54-4859-bcb8-94f19fd19332

When all the clients report finish working, the server will output:

All connected clients have finished and all the received patterns have been parsed by the server.
Enter 'y' or 'yes' to start evaluating pattern interestingness.
Enter any other words to keep waiting for more clients to connect

Now if you still have new clients need to connect to the server, you can enter other words.If you enter "y" or "yes" , the server will start to run the interestingness evaluation and then output result files.

  • Continue mining after network problems

If there is any network problem happens, as long as you do not shut down server and client processes, the mining should automatically continue after your network is recovered.

Use cases

Here's a list of pattern miner use cases. This is handy to drive the pattern miner development, its API, etc. Ideally each of these should be turned into examples put under [1], as well as unit tests.

  • Mining of inference histories, for inference control
  • Mining of dialogue histories, for learning dialogue patterns (or more generally, verbal/nonverbal interaction patterns)
  • Mining of sets of genomic datasets or medical patient records, to find surprisingly common combinations of features
  • Mining of surprising combinations of visual features in the output of a relatively "disentangled" deep NN (such as the pyramid-of-InfoGANs that Ralf, Selameab, Tesfa, Yenat and I are working on)
  • Mining of surprising combinations of semantic relationships, in the R2L output of a large number of simple sentences read into Atomspace
  • Mining of surprising combinations of syntactic relationships, in an Atomspace containing a set of syntactic relationships corresponding to each word in the dictionary of a given language (to be done iteratively within the language learning algorithm Linas is implementing)
  • Mining of surprising (link-parser link combination, Lojban-Atomese-output combination) pairs, in a corpus of (link parses, Lojban-Atomese outputs) obtained from a parallel (English, Lojban) corpus