Pattern Miner Scheme Functions

From OpenCog
Jump to: navigation, search

Get Start

(You can see the related pages for more details: Scheme, Pattern miner.)

Step 1. Start guile REPL:

~ $ guile

Output:

guile>

Step 2: Load pattern miner scm module

guile> (use-modules (opencog) (opencog patternminer))

Output:

Debug: PatternMiner init finished!

If you are going to load a corpus contains particular Link Types, you need to use the particular modules as well. For example, if you want to load NLP corpus, please also do:

guile> (use-modules (opencog nlp) (opencog nlp chatbot) (opencog nlp relex2logic) (opencog atom-types))

Step 3. Load corpus into AtomSpace

Before running any test with Pattern Miner, you need to load your corpus (in the format of scm) into AtomSpace. You can find some test corpora here if you need: https://github.com/opencog/test-datasets/tree/master/pattern%20miner

guile> (load "your path/yourCorpusFileName.scm")

Example:

guile> (load "../opencog/learning/PatternMiner/testdatasmall.scm")

If it is loaded successfully, the last Link in this corpus will be output in the guile shell, for example:

(EvaluationLink (stv 1 1)
   (PredicateNode "religion")
   (ListLink (stv 1 1)
      (ConceptNode "Bill_Clinton")
      (ConceptNode "Baptists")
   )
)

guile>

Interfaces for Settings

All the default settings can be found and set in the config file: /usr/local/etc/opencog_patternminer.conf. Most of them are also can be modified via scm interface in guile shell (see below).

pm-get-current-settings

Print all the current settings of Pattern Miner. No parameter.

guile> (pm-get-current-settings)

Output:

Current all settings:
max_gram: 3
enable_Interesting_Pattern: true
Frequency_threshold: 2
Ignore_Link_Types: ListLink
use_keyword_black_list: false
use_keyword_white_list: false
keyword_black_list: this that these those it he him her she
keyword_white_list: United_States Saudi_Arabia
keyword_white_list_logic:OR

A bunch of interfaces to query a particular setting (no parameters):

pm-get-pattern-max-gram
pm-get-enable-interesting-pattern
pm-get-frequency-threshold
pm-get-ignore-link-types
pm-get-use-keyword-black-list
pm-get-use-keyword-white-list
pm-get-keyword-black-list
pm-get-keyword-white-list
pm-get-keyword-white-list-logic
pm-get-enable-filter-node-types-should-not-be-vars
pm-get-node-types-should-not-be-vars
pm-get-enable-filter-links-of-same-type-not-share-second-outgoing
pm-get-same-link-types-not-share-second-outgoing

A bunch of interfaces to set a particular setting and filter

pm-set-pattern-max-gram // parameter: int, usually it is smaller than 6, because most real-life patterns are not that deep
pm-set-enable-interesting-pattern // parameter: bool, to evaluate interestingness after mining or not.
pm-set-frequency-threshold // parameter: int, it is the threshold for output results. All patterns will be mined even patterns only have 1 occurrence.

// ignore-links are Link types that will be skipped during mining
pm-add-ignore-link-type // parameter: string, the Link type name want to add, e.g.: AndLink
pm-remove-ignore-link-type // parameter: string, the Link type name want to remove, e.g.: OrLink

// keyword-black-list: any Link that contains a keyword in this list will be skipped during mining
pm-set-use-keyword-black-list // parameter: bool, to use keyword-black-list or not
pm-add-keyword-to-black-list // parameter: string, the keyword want to add, e.g.: "Physics"
pm-add-keywords-to-black-list // parameter: string, the keywords want to add, e.g.: "Physics, Chemistry, Biology"
pm-remove-keyword-from-black-list // parameter: string, the keyword want to remove, e.g.: "Physics"
pm-clear-keyword-black-list // no parameter, this will empty the keyword-black-list

// keyword-white-list is used in 2 functions: 
// (1). select a subset from the loaded corpus which contains the keywords in this list (see pm-select-whitelist-subset-from-atomspace)
// (2). query all the patterns contain keywords in this list (see pm-apply-whitelist-keyword-filter-after-mining)
pm-set-use-keyword-white-list //  parameter: bool, to use keyword-white-list or not
pm-set-keyword-white-list-logic // parameter: string "AND" or "OR", to determine if should contain all the words, or only need to contain one of the keywords in this list
pm-add-keyword-to-white-list // parameter: string, the keyword want to add, e.g.: "Physics"
pm-add-keywords-to-white-list // parameter: string, the keywords want to add, e.g.: "Physics, Chemistry, Biology"
pm-remove-keyword-from-white-list // parameter: string, the keyword want to remove, e.g.: "Physics"
pm-clear-keyword-white-list // no parameter, this will empty the keyword-white-list

// When enable-filter-node-types-should-not-be-vars is true,
// the Node Types in node-types-should-not-be-vars will not become VariableNodes in patterns
// you can also config it in the config file : node_types_should_not_be_vars
pm-set-enable-filter-node-types-should-not-be-vars // parameter: bool, to enable this filter or not
pm-add-node-type-to-node-types-should-not-be-vars // parameter: string, a Node Type to be added to this filter
pm-remove-node-type-from-node-types-should-not-be-vars //parameter: string, the Node Type to be removed from this filter
pm-clear-node-types-should-not-be-vars // no parameter, this will empty the node-types-should-not-be-vars list

// When enable-filter-links-of-same-type-not-share-second-outgoing is true,
// 2 or more Links of the same Type in same-link-types-not-share-second-outgoing list will not share their second outgoings.
// you can also config it in the config file : same_link_types_not_share_second_outgoing
pm-set-enable-filter-links-of-same-type-not-share-second-outgoing // parameter: bool, to enable this filter or not
pm-add-link-type-to-same-link-types-not-share-second-outgoing // parameter: string, a Link Type to be added to this filter
pm-remove-link-type-from-same-link-types-not-share-second-outgoing //parameter: string, the Link Type to be removed from this filter
pm-clear-same-link-types-not-share-second-outgoing // no parameter, this will empty the same-link-types-not-share-second-outgoing list

Interfaces for Running Pattern Miner

pm-run-patternminer

guile> (pm-run-patternminer)

Note to load corpus into AtomSpace before run this function. Somtimes it will take a long time if the corpus is large (larger than 3M) or pattern-max-gram is larger than 3. Need to make sure you have enough RAM before run it. When the Pattern Miner starts to run, in the cogserver terminal tab it will output:

Debug: PatternMining start! Max gram = 3, mode = Depth_First
Using 1 threads. 
Corpus size: 6564 links in total. 

Start thread 0: will process Link number from 0 to (excluded) 6564
2.5482% completed in Thread 0

When the mining is finished, all the patterns with a frequency higher than the threshold will be output to files and in the terminal it will output:

100% completed in Thread 0.d 0...
Finished mining 1~3 gram patterns.

processedLinkNum = 6564
PatternMiner:  mining finished!
PatternMiner:  done frequent pattern mining for 1 to 3gram patterns!
gram = 1: 7918 patterns found! 
Debug: PatternMiner: writing  (gram = 1) frequent patterns to file FrequentPatterns_1gram.scm

gram = 2: 19846 patterns found! 
Debug: PatternMiner: writing  (gram = 2) frequent patterns to file FrequentPatterns_2gram.scm

gram = 3: 140519 patterns found! 
Debug: PatternMiner: writing  (gram = 3) frequent patterns to file FrequentPatterns_3gram.scm

And then, if enable-interesting-pattern is set to true, it will start to evaluate interestingness of all the patterns with a frequency higher than the threshold. Currently, surprisingness measure (see Measuring Surprisingness ) is used to evaluate interestingness. Surpringness measure needs to evaluate subpatterns and superpatterns of a pattern, so that only 2 ~ (max_gram - 1) gram patterns can be evaluated. It will output:

Calculating interestingness for 2 gram patterns by evaluating surprisingness
12.5658% completed.

When it finishes, the results will be output to files and in the terminal it will output:

100% completed.PatternMiner:  done (gram = 2) interestingness evaluation!19846 patterns found! Outputting to file ... 
Debug: PatternMiner: writing (gram = 2) interesting patterns to file SurprisingnessI_2gram.scm

Debug: PatternMiner: writing (gram = 2) interesting patterns to file SurprisingnessII_2gram.scm
surprisingness_II_threshold for 2 gram = 0.5
Debug: PatternMiner: writing (gram = 2) final top patterns to file FinalTopPatterns_2gram.scm

And then it will continue on calculating the next gram. When everything is finished, it will output:

Pattern Mining Finished! Total time: xxxx seconds.

pm-reset-patternminer

guile> (pm-reset-patternminer #t)

Reset the pattern miner. This is usually called after mining to clean up and get ready for the next mining. It will discard all the pattern mining results. Note that if Pattern Miner run again, the result files will be overwritten, so you need to copy them somewhere else if you want to keep them. The parameter is a Boolean: true is to reset all the settings from the config file; false is to keep the current settings.

pm-apply-whitelist-keyword-filter-after-mining

guile> (pm-apply-whitelist-keyword-filter-after-mining)

This is to call after mining to query all the patterns that contains the words in the keyword whitelist. The keyword_white_list_logic can be AND or OR. AND: a pattern needs to contain all the words in the whitelist to be selected; OR: a pattern only needs to contain one of the words in whitelist. keyword_white_list_logic and keyword_white_list can be set (See above interfaces about the settings). The results will be output to files. The output:

PatternMiner:  applying keyword white list (OR) filter:United_States Saudi_Arabia
gram = 1: 15 patterns found after filtering! 
Debug: PatternMiner: writing  (gram = 1) frequent patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-FrequentPatterns_1gram.scm

gram = 2: 14 patterns found after filtering! 
Debug: PatternMiner: writing  (gram = 2) frequent patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-FrequentPatterns_2gram.scm

gram = 3: 7 patterns found after filtering! 
Debug: PatternMiner: writing  (gram = 3) frequent patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-FrequentPatterns_3gram.scm


Debug: PatternMiner: writing (gram = 2) interesting patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-SurprisingnessI_2gram.scm

Debug: PatternMiner: writing (gram = 2) interesting patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-SurprisingnessII_2gram.scm

Debug: PatternMiner: writing (gram = 3) interesting patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-SurprisingnessI_3gram.scm

applyWhiteListKeywordfilterAfterMining finished!

pm-select-subset-from-atomspace

guile> (pm-select-subset-from-atomspace "United_States,Saudi_Arabia" 3)

After load a corpus into the AtomSpace, call this function is to select a subset of the current AtomSpace including all the Links that contain any of the keywords in the first parameter and the N-distance (N is the specified by the second parameter) connected neighboring Links of these Links. Note that the keywords in the first parameter should be separated by ",". The result subset corpus will be output to a file. The outputs:

Selecting a subset from loaded corpus in Atomspace for the following keywords within 3 distance:
United_States
Saudi_Arabia

Done! Subset size: 1564 Links in total. The subset has been written to file:  SubSet-United_States-Saudi_Arabia.scm

pm-select-whitelist-subset-from-atomspace

The same as select_subset_from_atomspace, but it will use the words in the keyword_white_list.

guile> (pm-select-whitelist-subset-from-atomspace 2)

Next Steps

To improve the API for the Pattern Miner, we are collecting some information on Pattern Miner Prospective Examples here...