Pattern Miner Scheme Functions
- 1 Get Start
- 2 Interfaces for Settings
- 3 Interfaces for Running Pattern Miner
- 4 Next Steps
Step 1. Start guile REPL:
~ $ guile
Step 2: Load pattern miner scm module
guile> (use-modules (opencog) (opencog patternminer))
Debug: PatternMiner init finished!
If you are going to load a corpus contains particular Link Types, you need to use the particular modules as well. For example, if you want to load NLP corpus, please also do:
guile> (use-modules (opencog nlp) (opencog nlp chatbot) (opencog nlp relex2logic) (opencog atom-types))
Step 3. Load corpus into AtomSpace
Before running any test with Pattern Miner, you need to load your corpus (in the format of scm) into AtomSpace. You can find some test corpora here if you need: https://github.com/opencog/test-datasets/tree/master/pattern%20miner
guile> (load "your path/yourCorpusFileName.scm")
guile> (load "../opencog/learning/PatternMiner/testdatasmall.scm")
If it is loaded successfully, the last Link in this corpus will be output in the guile shell, for example:
(EvaluationLink (stv 1 1) (PredicateNode "religion") (ListLink (stv 1 1) (ConceptNode "Bill_Clinton") (ConceptNode "Baptists") ) ) guile>
Interfaces for Settings
All the default settings can be found and set in the config file: /usr/local/etc/opencog_patternminer.conf. Most of them are also can be modified via scm interface in guile shell (see below).
Print all the current settings of Pattern Miner. No parameter.
Current all settings: max_gram: 3 enable_Interesting_Pattern: true Frequency_threshold: 2 Ignore_Link_Types: ListLink use_keyword_black_list: false use_keyword_white_list: false keyword_black_list: this that these those it he him her she keyword_white_list: United_States Saudi_Arabia keyword_white_list_logic:OR
A bunch of interfaces to query a particular setting (no parameters):
pm-get-pattern-max-gram pm-get-enable-interesting-pattern pm-get-frequency-threshold pm-get-ignore-link-types pm-get-use-keyword-black-list pm-get-use-keyword-white-list pm-get-keyword-black-list pm-get-keyword-white-list pm-get-keyword-white-list-logic pm-get-enable-filter-node-types-should-not-be-vars pm-get-node-types-should-not-be-vars pm-get-enable-filter-links-of-same-type-not-share-second-outgoing pm-get-same-link-types-not-share-second-outgoing
A bunch of interfaces to set a particular setting and filter
pm-set-pattern-max-gram // parameter: int, usually it is smaller than 6, because most real-life patterns are not that deep pm-set-enable-interesting-pattern // parameter: bool, to evaluate interestingness after mining or not. pm-set-frequency-threshold // parameter: int, it is the threshold for output results. All patterns will be mined even patterns only have 1 occurrence. // ignore-links are Link types that will be skipped during mining pm-add-ignore-link-type // parameter: string, the Link type name want to add, e.g.: AndLink pm-remove-ignore-link-type // parameter: string, the Link type name want to remove, e.g.: OrLink // keyword-black-list: any Link that contains a keyword in this list will be skipped during mining pm-set-use-keyword-black-list // parameter: bool, to use keyword-black-list or not pm-add-keyword-to-black-list // parameter: string, the keyword want to add, e.g.: "Physics" pm-add-keywords-to-black-list // parameter: string, the keywords want to add, e.g.: "Physics, Chemistry, Biology" pm-remove-keyword-from-black-list // parameter: string, the keyword want to remove, e.g.: "Physics" pm-clear-keyword-black-list // no parameter, this will empty the keyword-black-list // keyword-white-list is used in 2 functions: // (1). select a subset from the loaded corpus which contains the keywords in this list (see pm-select-whitelist-subset-from-atomspace) // (2). query all the patterns contain keywords in this list (see pm-apply-whitelist-keyword-filter-after-mining) pm-set-use-keyword-white-list // parameter: bool, to use keyword-white-list or not pm-set-keyword-white-list-logic // parameter: string "AND" or "OR", to determine if should contain all the words, or only need to contain one of the keywords in this list pm-add-keyword-to-white-list // parameter: string, the keyword want to add, e.g.: "Physics" pm-add-keywords-to-white-list // parameter: string, the keywords want to add, e.g.: "Physics, Chemistry, Biology" pm-remove-keyword-from-white-list // parameter: string, the keyword want to remove, e.g.: "Physics" pm-clear-keyword-white-list // no parameter, this will empty the keyword-white-list // When enable-filter-node-types-should-not-be-vars is true, // the Node Types in node-types-should-not-be-vars will not become VariableNodes in patterns // you can also config it in the config file : node_types_should_not_be_vars pm-set-enable-filter-node-types-should-not-be-vars // parameter: bool, to enable this filter or not pm-add-node-type-to-node-types-should-not-be-vars // parameter: string, a Node Type to be added to this filter pm-remove-node-type-from-node-types-should-not-be-vars //parameter: string, the Node Type to be removed from this filter pm-clear-node-types-should-not-be-vars // no parameter, this will empty the node-types-should-not-be-vars list // When enable-filter-links-of-same-type-not-share-second-outgoing is true, // 2 or more Links of the same Type in same-link-types-not-share-second-outgoing list will not share their second outgoings. // you can also config it in the config file : same_link_types_not_share_second_outgoing pm-set-enable-filter-links-of-same-type-not-share-second-outgoing // parameter: bool, to enable this filter or not pm-add-link-type-to-same-link-types-not-share-second-outgoing // parameter: string, a Link Type to be added to this filter pm-remove-link-type-from-same-link-types-not-share-second-outgoing //parameter: string, the Link Type to be removed from this filter pm-clear-same-link-types-not-share-second-outgoing // no parameter, this will empty the same-link-types-not-share-second-outgoing list
Interfaces for Running Pattern Miner
Note to load corpus into AtomSpace before run this function. Somtimes it will take a long time if the corpus is large (larger than 3M) or pattern-max-gram is larger than 3. Need to make sure you have enough RAM before run it. When the Pattern Miner starts to run, in the cogserver terminal tab it will output:
Debug: PatternMining start! Max gram = 3, mode = Depth_First Using 1 threads. Corpus size: 6564 links in total. Start thread 0: will process Link number from 0 to (excluded) 6564 2.5482% completed in Thread 0
When the mining is finished, all the patterns with a frequency higher than the threshold will be output to files and in the terminal it will output:
100% completed in Thread 0.d 0... Finished mining 1~3 gram patterns. processedLinkNum = 6564 PatternMiner: mining finished! PatternMiner: done frequent pattern mining for 1 to 3gram patterns! gram = 1: 7918 patterns found! Debug: PatternMiner: writing (gram = 1) frequent patterns to file FrequentPatterns_1gram.scm gram = 2: 19846 patterns found! Debug: PatternMiner: writing (gram = 2) frequent patterns to file FrequentPatterns_2gram.scm gram = 3: 140519 patterns found! Debug: PatternMiner: writing (gram = 3) frequent patterns to file FrequentPatterns_3gram.scm
And then, if enable-interesting-pattern is set to true, it will start to evaluate interestingness of all the patterns with a frequency higher than the threshold. Currently, surprisingness measure (see Measuring Surprisingness ) is used to evaluate interestingness. Surpringness measure needs to evaluate subpatterns and superpatterns of a pattern, so that only 2 ~ (max_gram - 1) gram patterns can be evaluated. It will output:
Calculating interestingness for 2 gram patterns by evaluating surprisingness 12.5658% completed.
When it finishes, the results will be output to files and in the terminal it will output:
100% completed.PatternMiner: done (gram = 2) interestingness evaluation!19846 patterns found! Outputting to file ... Debug: PatternMiner: writing (gram = 2) interesting patterns to file SurprisingnessI_2gram.scm Debug: PatternMiner: writing (gram = 2) interesting patterns to file SurprisingnessII_2gram.scm surprisingness_II_threshold for 2 gram = 0.5 Debug: PatternMiner: writing (gram = 2) final top patterns to file FinalTopPatterns_2gram.scm
And then it will continue on calculating the next gram. When everything is finished, it will output:
Pattern Mining Finished! Total time: xxxx seconds.
guile> (pm-reset-patternminer #t)
Reset the pattern miner. This is usually called after mining to clean up and get ready for the next mining. It will discard all the pattern mining results. Note that if Pattern Miner run again, the result files will be overwritten, so you need to copy them somewhere else if you want to keep them. The parameter is a Boolean: true is to reset all the settings from the config file; false is to keep the current settings.
This is to call after mining to query all the patterns that contains the words in the keyword whitelist. The keyword_white_list_logic can be AND or OR. AND: a pattern needs to contain all the words in the whitelist to be selected; OR: a pattern only needs to contain one of the words in whitelist. keyword_white_list_logic and keyword_white_list can be set (See above interfaces about the settings). The results will be output to files. The output:
PatternMiner: applying keyword white list (OR) filter:United_States Saudi_Arabia gram = 1: 15 patterns found after filtering! Debug: PatternMiner: writing (gram = 1) frequent patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-FrequentPatterns_1gram.scm gram = 2: 14 patterns found after filtering! Debug: PatternMiner: writing (gram = 2) frequent patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-FrequentPatterns_2gram.scm gram = 3: 7 patterns found after filtering! Debug: PatternMiner: writing (gram = 3) frequent patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-FrequentPatterns_3gram.scm Debug: PatternMiner: writing (gram = 2) interesting patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-SurprisingnessI_2gram.scm Debug: PatternMiner: writing (gram = 2) interesting patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-SurprisingnessII_2gram.scm Debug: PatternMiner: writing (gram = 3) interesting patterns to file WhiteKeyWord-OR-United_States-Saudi_Arabia-SurprisingnessI_3gram.scm applyWhiteListKeywordfilterAfterMining finished!
guile> (pm-select-subset-from-atomspace "United_States,Saudi_Arabia" 3)
After load a corpus into the AtomSpace, call this function is to select a subset of the current AtomSpace including all the Links that contain any of the keywords in the first parameter and the N-distance (N is the specified by the second parameter) connected neighboring Links of these Links. Note that the keywords in the first parameter should be separated by ",". The result subset corpus will be output to a file. The outputs:
Selecting a subset from loaded corpus in Atomspace for the following keywords within 3 distance: United_States Saudi_Arabia Done! Subset size: 1564 Links in total. The subset has been written to file: SubSet-United_States-Saudi_Arabia.scm
The same as select_subset_from_atomspace, but it will use the words in the keyword_white_list.
guile> (pm-select-whitelist-subset-from-atomspace 2)
To improve the API for the Pattern Miner, we are collecting some information on Pattern Miner Prospective Examples here...