Code and dataset for the paper "Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN". If you use this code or dataset, please cite the paper.
-
Clone this repository
cd ~; git clone https://github.com/epochx/opinatt -
Install required libraries
- Install Theano and Tensorflow:
pip install Theano tensorflow-gpu - Download and install Senna, http://ronan.collobert.com/senna/
- Download and install CoreNLP 3.6, http://stanfordnlp.github.io/CoreNLP/history.html
- Install Theano and Tensorflow:
-
Go to the opinatt diretory
cd ~/opinattand download the required data there- Download GoogleNews embeddings from https://code.google.com/archive/p/word2vec/
- Download WikiDeps embeddings from https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
- Download and unzip the SemEval2014 V2 Train data from http://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools
-
Create the working environment
- Create data folder structure,
prepare_data.sh path_where_to_create_data_folder - Modify
opinatt/enlp/settings.pyaccordingly
- Create data folder structure,
-
Preprocess corpora by running
python process_corpus.py -
Generate training data in JSON format:
- For aspect extraction run
python generate_json.py --json_path path/to/json/file --train LaptopsTrain --test LaptopsTest --embeddings SennaEmbeddings. - For aspect extraction and sentiment classification using collapsed tags, run
python generate_json.py --json_path path/to/json/file --train LaptopsTrain --embeddings SennaEmbeddings --sentiment. - For joint aspect extraction and sentiment classification run
python generate_json.py --json_path path/to/json/file --train LaptopsTrain --embeddings SennaEmbeddings --joint - When generating the training data for the Youtubean dataset, or when doing so including sentiment labels for the SemEval corpora, make sure not to pass the
--testflag, as there are no test sets for those settings. The script will generate the corresponding splits based on the training portion for each case. Add the--strictlabel to only use sentences with a single sentiment.
- For aspect extraction run
-
Train the models:
- For the baseline:
python run_baseline.py --json_path path/to/json/file --results_path path/to/results/folder - For the attention-RNN:
python run.py --json_path path/to/json/file --results_path path/to/results/folder
- For the baseline:
For more details on how to use the provided scripts use the --help option.
Contact
Feel free to email emarrese@weblab.t.u-tokyo.ac.jp for any pertinent questions/bugs regarding the code.