Done as part of CS-570: Data Mining instructed by Dr. Carl Yang.
- The Data
- Further Preprocessing
- Partitioning
- Mining Frequent Patterns for Each Topic
- Mining Maximal / Closed Patterns
- Re-rank by Purity of Patterns
- Mapping Term ID Numbers to Terms from the Vocabulary
- Source Code and Method Details
- Additional Information
The data consists of paper titles from conferences in computer science of 5 domains: Data Mining (DM), Machine
Learning (ML), Database (DB), Information Retrieval (IR) and Theory (TH). The raw data is available in
paper_raw.txt. Each line contains two columns, PaperID and Title of a paper, separated by
Tab(‘\t’). The PaperID is unique in the whole data set. However, since this data is a subset of all the paper titles
from a large corpus, PaperIDs are not starting from 0 and not consecutive. The raw data looks like this:
PaperID |
Title |
|---|---|
| 7600 | The Automatic Acquisition of Proof Methods |
| 85825 | Frequent pattern discovery with memory constraint |
The raw data has been pre-processed by removing stop words, converting the words to lower cases and lemmatization. The
processed data is available in paper.txt. In this file, each line is a PaperID followed by a list of
terms. The format is PaperID Tab(‘\t’) term1 Space(‘ ’) term2 Space(‘ ’) .... paper.txt looks like this:
PaperID |
[term1] [term2] [term3] [term4] ... [termN] |
|---|---|
| 7600 | automatic acquisition proof method |
| 85825 | frequent pattern discovery memory constraint |
We will be working on the pre-processed data in paper.txt.
This step prepares input for LDA (Latent Dirichlet Allocation). Two files will be generated based on paper.txt.
First, generate a vocabulary from paper.txt. The vocabulary file will be names as vocab.txt. Each line in this file is a unique term extracted from paper.txt; each term should appear exactly once. Here is an example of the first five lines in vocab.txt. Note that the sequence of terms may be different.
automatic
acquisition
proof
method
philosophical
...
A sample can be found in vocab.txt.
In this step, each title is represented as a sparse vector of word counts. Each title is transformed (i.e., each line in
paper.txt) into the following format ([M] (space) [t1]:[c1] (space) [t2]:[c2] (space) ...):
[M] [term 1]:[count] [term 2]:[count] ... [term N]:[count]
where [M] is the number of unique terms in the title, and the [count] associated with each term is how many times
that term appeared in the title. For example, in paper.txt, suppose we have a title “automatic acquisition proof method”. The corresponding line in the output file should be “4 0:1 1:1 2:1 3:1”. Note that
[term_i] is an integer which indexed a term in vocab.txt; it starts from 0. The output file is named as
title.txt.
A sample available in title.txt.
Note: if you use any other lda package rather than the one mentioned in next step, please make sure the format in title.txt match what your lda package requires.
Recall that paper titles were collected from conferences of 5 domains in computer science. If we run frequent pattern mining algorithms directly on paper titles, the patterns would be independent of topics. Thus, if we want to mine frequent patterns in each domain, titles and terms should be partitioned into 5 domains. Note that the domain knowledge is not known to you. Instead, we apply topic modeling to uncover the hidden topics behind titles and terms automatically. Specifically, we apply LDA to assign a topic to each term.
-
Download the LDA package here. Unzip it, and you can see a list of source code.
-
Open a terminal and go to the directory of the source code. Type
make. An executable lda will be is generated. -
In
lda-c-distdirectory, there is asettings.txtfile. You could use the settings provided there, but feel free to tune the parameters if you know how inference works for LDA.var max iter 20 var convergence 1e-6 em max iter 100 em convergence 1e-4 alpha estimate -
Run LDA with the following command:
<dir>/lda-c-dist/lda est 0.001 5 <dir>/lda-c-dist/settings.txt <dir>/title.txt random <dir>/resultwhere
<dir>is your own working directory, and0.001is the prior for topic proportion (it is just a parameter, and you don’t have to change it if you don’t know how) given to LDA.5represents 5 topics.title.txtis the file you generate at Step 2. The output will be put intoresultdirectory (a sample can be found in result/). -
Check your output.
In the
resultdirectory, open fileword-assignments.dat. Each line is of the form ([M] (space) [t1]:[k1] (space) [t2]:[k2] (space) ...):[M] [term 1]:[topic] [term 2]:[topic] ... [term N]:[topic]where
[M]is the number of unique terms in the title, and[topic]associated with each term is the topic assigned to it. Topic number starts from0. One line could be:004 0000:02 0003:02 0001:02 0002:02; which means that all terms in this title are assigned to the 3rd topic (topic 02). Note that you are not confined to use this package. Here is another option: Topic Modeling | Mallet.A sample is available in result/word-assignments.dat.
You are asked to re-organize the terms by 5 topics. For the i-th topic, you should create a file named topic-i.txt.
Separate each line in word-assignment.dat by topics assigned to them. For example, the lines in word-assignment.dat
can be considered as the following form (Note that here we replace the integers with the real terms for better
illustration):
004 automatic:02 acquisition:02 proof:02 method:02
005 classification:03 noun:02 phrase:03 concept:01 individual:03
Then your output files would be:
topic-0.txt
...
topic-1.txt
concept
...
topic-2.txt
automatic acquisition proof method
noun
...
topic-3.txt
classification phrase individual
...
In the real files, each term should be represented as the id corresponding to the dictionary generated from Step 2
(e.g., 0001, 0056). topic-i.txt looks like this:
[term1] [term2] ... [termN]
[term1] [term2] ... [termN]
...
See samples: topic-0.txt, topic-1.txt, topic-2.txt, topic-3.txt, topic-4.txt.
In this step, the frequent pattern mining algorithm, Apriori Algorithm is implemented. The algorithm will be run on 5
files corresponding to 5 topics. The output is of the form ([s] (space) [t1 (space) t2 (space) t3 (space) ...]), where
each term should be represented as the term id:
#Support [frequent pattern]
#Support [frequent pattern]
...
The frequent patterns are sorted from high to low by #Support. The output files is to be put into a directory named as
patterns. The i-th file is named as pattern-i.txt.
The min_sup for a frequent pattern was chosen to be 1% (or 0.01) as it was the largest value which gave a decent
amount of patterns per topic (∼ 70 patterns on average), and patterns with at least 3 items were present (you may choose
your own).
Samples present in patterns/.
In this step, you need to implement an algorithm to mine maximal patterns and closed patterns. You could write the code
based on the output of Step 4, or implement a specific algorithm to mine maximal/closed patterns, such as CLOSET,
MaxMiner, etc.
The output should be of the same form as the output of Step 4. Max patterns are put into max directory. The i-th file is named as max-i.txt. Closed patterns are put into closed directory. The i-th file is named as closed-i.txt.
View samples for max patterns and closed patterns in max/ and closed/ respectively.
In this implementation Purity is introduced as one of the ranking measures for phrases. A phrase is pure in topic
_t if it is only frequent in documents (here documents refer to titles) about topic t and not frequent in documents
about other topics. For example, ‘query processing’ is a more pure phrase than ‘query’ in the Database topic.
We measure the purity of a pattern by comparing the probability of seeing a phrase in the topic-
Purity essentially measures the distinctness of a pattern in one topic compared to that in any other topic. The definition is as follows:
Here
Purity [frequent pattern]
Purity [frequent pattern]
...
and frequent patterns are sorted from high to low by a measure which combines Support and Purity (you will need to come up with how to combine them). Your output files should be put into one directory named as purity. The i-th file is named as purity-i.txt.
For this implementation, the patterns were re-ranked based on the score generated as follows:
where
This was done so that the purity values were squashed to a value in the range
You can see the samples in purity/.
This mapping step helps you to examine the quality of the mined patterns and see if they make sense in terms of forming phrases.
For every file in patterns / max / closed / purity, map the id numbers to terms based on your dictionary. Use the same format with the only difference that each integer is replaced by a term, name each new file as originalFileName.phrase and put it in the same directory as the original file. For example, pattern-0.txt will be mapped to pattern-0.txt.phrase and it will be put in the directory patterns.
-
hw1.py:The controller module - calls required class methods to perform the steps in the homework. Steps taken are:
- Pre-processing the papers
- Partitioning the topics by terms generated by the lda program
- Mining frequent, maximal, and closed patterns by topic
- Re ranking each set of frequent patterns by purity
-
modules/constants.py:The module storing the various constants used.
-
modules/utils.py:Module containing some utility functions.
-
modules/preprocessor.py:The module that takes care of preprocessing the papers data.
-
modules/partitioner.py:The module that takes care of partitioning the data generated by the preprocessor and the lda program
-
modules/apriori.py:The module implementing the apriori algorithm.
-
modules/pattern_miner.py:The module that mines the frequent, maximal, and closed patterns for each topic’s transaction DB.
- Mines frequent patterns using the apriori algorithm
- Mines maximal patterns
- Mines closed patterns
-
modules/purity.py:Module that takes care of re-ranking a set of frequent patterns based on a score generated using their purity and support.
The pseudocode for the Apriori algorithm would be:
get frequent 1-itemset from transaction db
repeat while candidates are not empty:
generate k + 1 length candidate itemsets from frequent k-itemsets
prune candidates whose subsets are not frequent
test the final set of candidates and get frequent k+1-itemsets
set k = k + 1
set k = 1
set max_patterns = frequent_patterns
repeat while k != max_k:
for each itemset in frequent k-itemset:
if itemset is a subset of frequent k+1-itemset:
delete the itemset from max_patterns
set k = k + 1
return max_patterns
set k = 1
set closed_patterns = frequent_patterns
repeat while k != max_k:
for each itemset in frequent k-itemset:
if (itemset is a subset of frequent k+1-itemset) and (support of the itemset and matching superset is equal):
delete the itemset from closed_patterns
set k = k + 1
return closed_patterns