Scripts to process the jst data into a more meaningful and readable format
-
Data must be first collected, processed and analysed by the following scripts listed in order:
Follow the instructions of the above mentioned packages to run them.
Sample config file available.
Need to create a data directory for all input and results folder for the output
The summarisations uses for input 5 items: 4 text files: documentThetha, documentPi, topicWords and topicSentences and a folder with raw-tweets.
The four text files mentioned above are copied from the jst final results (final.thetha, final.pi, final.twords and final.topSentences) and pasted and saved in txt format with the above mentioned names. The names are only specified in the config file, however keeping to the same names makes it less time consuming as there's no need to touch the config file.
Note: do not change the extention of the jst files themselves, as this ruins their formatting, just copy the contents and paste them in a new file.
The raw-tweeets folder is a subdirectory of the data folder which contains raw texts generated by the raw-tweets.py script in pyMysql.
Open console and navigate to the directory of the summarisation-scripts package. When you get there type:
summary_of_topics.py
Note: After each run of the script make sure to copy the result files to a different directory or give them a relevant name as currently the code cannot generate useful names and it simply overwrites the result files.
-
json file to be used for visualisation of summaries in html
-
text file of all topic summaries
-
spreadsheet of all topic summaries, ordered by topic importance
-
spreadsheet of all topics, ordered by importance
-
csv file of all topics