To reprocdue the result for the paper :
Download Sentential RE DATA, if you may train EqualProbR, otherwise skip:
Download only the folowing word analogy question answering datasets(For review we have included these files as tar.gz, only extract it and replace the essential_files and unprocessed_data in home directory with extracted folders of same names)
-
sat, u2,u4,bats,google download and place it under the folder : unprocessed_data/analogy_data.
-
scan after download/clone data from Tamara Czinczoll and place test.jsonl, and valid.jsonl under the foler :unprocessed_data/some_extra/scan/.
-
distractor verbal analogy from Jones et. al and place it under the foler unprocessed_data/analogy_data/osfstorage-archive/
-
google_easy,google_hard was generated from the google analogy data by by Mikolov et a
-
RS after download/clone data from ushio-etal-2021 and place test.jsonl, and valid.jsonl under the foler :unprocessed_data/some_extra/RS/
-
ekar data from chen-etal-2022 can be used with hugginface library.
Download following chekcing points for some of the word embedding models:
- Fasttext extract and place in the folder fasttext
- word2vec extract and place it in the loves-word2vec
- Glove extract and place in the folder gloves-word2vec
pip install git+https://github.com/glassroom/heinsen_routing
pip install transformers
pip3 install torch torchvision
pip installl tqdm
preprocess General: \
- First run the fllowing to make multiple choice analogy questions, easy and hard:
python preprocess_word_analogy.py
Preprocess Specific:
- The following data can be proprocess with the commmand below
- datanames=['conceptqa','wordanalogy','semeval_2012']+ ['BLESS','EVALution','CogALexV','ROOT09']
Tokenizer_Names=['bert-large-uncased','gpt1','roberta-large','opt','prophetnet','t5-large']
python main.py --task preprocess --data *dataName* --tokenizer_name tokenizerName
- To evaluate the word embedding models on AnalogyQA-Easy or Hard as in the Table2, effecient approach is to run the follwoing commands in order:
- First preprocess (change roberta-large to word embedding model of your choice) :
python main.py --task preprocess --data wordanalogy --tokenizer_name roberta-large
- Now To evalaute (with roberta-large)run the following command
Make sure the following configurations are correct:Configurations:
ModelName=ModelName_possible_values[0] #Experimens.py line 266-272
backend_trained=backendtrained_possible_values[-1] #Experimens.py line 266-272
ModelName=ModelName_possible_values[-1] # Experiments.py line 266
wordanalogy_test_data=['wikidata_easy',] # set to datasets you want to see the reult Experiments.py line 305Command:
python main.py --task train --data wordanalogy --experiment wordanalogy --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- For fasttext, and also other models (except trained models and gpt4.0) `first set the variable experiment_name='cat1_solve_analogies' and then run the following: Note: change the data and model in lines 169-172 , Additional_Experiments.py (default is fasttext).
python main.py --task train --data wordanalogy --experiment additional_exp --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- To collect response from ChatGPT : first
set the variable experiment_name=evaluate_gpt, in Experiments.py line 73andthe api key in line 2019 gpt_util.pythen run the following:
python main.py --task train --data wordanalogy --experiment additional_exp --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
-
Note1: The chatGPT can evaluated on ConceptQA(easy/hard), analogyQA(Easy/Hard), and common word analogy benchmarks, you just need to select the data you want to evaluate on
The data, and model can be changed in Additional_Experiments.py line 170 -
Note2: To Evaluate the response, use GPT_eval.ipynb notebook. Some times minor adjustments may be needed.
- Training the each of the models : Mini-RelBERT, EqualProbR, Sameconcept, similarOffset
- Mini-RelBERT
- First preprocess (change roberta-large to word embedding model of your choice) :
python main.py --task preprocess --data wikidata --tokenizer_name roberta-large
- pre-train with sentential re :
python main.py --task eval --data wikidata --experiment sentential_re_paper --model_to_train rc --tokenizer_name roberta-large
- Now preprocess (change roberta-large to word embedding model of your choice) :
python main.py --task preprocess --data semeval_2012 --tokenizer_name roberta-large
- To Train
python main.py --task train --data semeval_2012 --experiment semeval_2012 --model_to_train rc --tokenizer_name roberta-large
- To Evaluate on word analogy ('change evaluation data on Train_Eval.py, line 612, defualt is sat')
python main.py --task eval --data semeval_2012 --experiment semeval_2012 --model_to_train rc --tokenizer_name roberta-large
- SimilarOffset
- First preprocess (change roberta-large to word embedding model of your choice) :
python main.py --task preprocess --data wordanalogy --tokenizer_name roberta-large
- Train and Evaluate at once :
ModelName=ModelName_possible_values[0] #Experimens.py line 266-272
backend_trained=backendtrained_possible_values[0] #Experimens.py line 266-272python main.py --task train --data wordanalogy --experiment wordanalogy --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- To Evaluate only ('change evaluation data on Train_Eval.py, line 612, defualt is sat')
python main.py --task eval --data wordanalogy --experiment wordanalogy --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- EqualProbR and Lexical Relation Classification ( Table5, Tabel6,)
- EqualProbR for word analogy, we train on lexical relation classification data set(EVALution, and on related Entities from wikidata). This instruction also works for lexcial Relation classification in Table5.
data=['BLESS','EVALution','CogALexV','ROOT09', 'wikidata']
- First preprocess (change roberta-large to word embedding model of your choice) :
python main.py --task preprocess --data dataname --tokenizer_name roberta-large
- To Train and evaluate on test set
python main.py --task train --data dataname --experiment lexical_offset --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- To Evaluate EqualPRob R on word analogy run the follwoing, make sure to set the following:
ModelName=ModelName_possible_values[0] #Experimens.py line 266-272
backend_trained=backendtrained_possible_values[0] #Experimens.py line 266-272
wordanalogy_test_data=['wikidata_easy',] # set to datasets you want to see the reult Experiments.py line 305python main.py --task eval --data dataname --experiment wordanalogy --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- Note: all other word anlogy datasets are evauated on EqualProbR model trained on wikdiata, while EVALutionEasy/Hard on model trained on EVALution
Set the specific experiment related to mtcqa in Multchoice_Model.py line 51 , and then run the following
python main.py --task train --data mtcqa --experiment mtcqa --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- Pretraing with analogyQA-Easy vs Hard
- Mini-RELBERT Only set args.hard=False(Experiments.py line ) , and train as instructed for MiniRElBert
- MTCQA :set it in Multichoice_Model.py line 51
- Baseline: set the following configurations first,
ModelName=ModelName_possible_values[-1] #Experimens.py line 266-272
backend_trained=backendtrained_possible_values[-1] #Experimens.py line 266-272
m='baseline_train' #Experiments.py lines 304-314
wordanalogy_train_data=['wikidata_hard',] #Experiments.py lines 304-314- EqualOFfset:
for baseline model set set ModelName='sentence_route' (Experiments.py line 245 ) and set args.similarity_measure='offset' (Experiments.py line 270).
- Bayesain Analysis of word semantics and word relations
- First set experiment_name='cat2_basian_analysis' in Experiments.py line 73, then run the following
python main.py --task train --data wordanalogy --experiment additional_exp --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- Various Permutation of word Analogy
- First set experiment_name='cat1_plot_permutations_dist' in Experiments.py line 73, then run the following
python main.py --task train --data wordanalogy --experiment additional_exp --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- Ineragreement between models
- First set experiment_name='cat1_interagreement_between_models' in Experiments.py line 73, then run the following
python main.py --task train --data wordanalogy --experiment additional_exp --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- Lexical Relation Classification with Sentence Representation
- After Preprocess, to Train and evaluate on test set
python main.py --task train --data dataname --experiment sentential_re_paper --model_to_train wordanalogy_re_model --tokenizer_name roberta-large
- Cluster of words in the Appendix
- After setting
experiment_name='cat2_pca_top_eignen_vecotr_concept_based'in Experimnts.py line 74 run the following:
python main.py --task train --data google --experiment additional_exp --model_to_train wordanalogy_re_model fasttext
We used the code for the loss function from [RelBert](https://github.com/asahi417/relbert) for MiniRelBert.For plots, we used gemini and chatGPT to provide us with templates, and then modified that according to our need.