Skip to content

bahramiramazan/offset_similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

To reprocdue the result for the paper :

Offset Similarity in Semantically Similar Words

Dataset

Download Sentential RE DATA, if you may train EqualProbR, otherwise skip:

Download only the folowing word analogy question answering datasets(For review we have included these files as tar.gz, only extract it and replace the essential_files and unprocessed_data in home directory with extracted folders of same names)

Download following chekcing points for some of the word embedding models:

  • Fasttext extract and place in the folder fasttext
  • word2vec extract and place it in the loves-word2vec
  • Glove extract and place in the folder gloves-word2vec

Setup

pip install git+https://github.com/glassroom/heinsen_routing
pip install transformers
pip3 install torch torchvision
pip installl tqdm

Running

Preprocessing Analogy

preprocess General: \

  • First run the fllowing to make multiple choice analogy questions, easy and hard:
python preprocess_word_analogy.py

Preprocess Specific:

  • The following data can be proprocess with the commmand below
  • datanames=['conceptqa','wordanalogy','semeval_2012']+ ['BLESS','EVALution','CogALexV','ROOT09']

Tokenizer_Names=['bert-large-uncased','gpt1','roberta-large','opt','prophetnet','t5-large']

python main.py  --task preprocess  --data *dataName*  --tokenizer_name tokenizerName

Evaluation and Training

unsupervised vector Offset(without training)

  1. To evaluate the word embedding models on AnalogyQA-Easy or Hard as in the Table2, effecient approach is to run the follwoing commands in order:
  • First preprocess (change roberta-large to word embedding model of your choice) :
python main.py  --task preprocess  --data wordanalogy --tokenizer_name roberta-large

  • Now To evalaute (with roberta-large)run the following command Make sure the following configurations are correct: Configurations:
 ModelName=ModelName_possible_values[0] #Experimens.py line 266-272
 backend_trained=backendtrained_possible_values[-1] #Experimens.py line 266-272
 ModelName=ModelName_possible_values[-1] # Experiments.py line 266
 wordanalogy_test_data=['wikidata_easy',] # set to datasets you want to see the reult Experiments.py line 305

Command:

python main.py  --task train  --data wordanalogy  --experiment wordanalogy  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large


  1. For fasttext, and also other models (except trained models and gpt4.0) `first set the variable experiment_name='cat1_solve_analogies' and then run the following: Note: change the data and model in lines 169-172 , Additional_Experiments.py (default is fasttext).

python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

  1. To collect response from ChatGPT : first set the variable experiment_name=evaluate_gpt, in Experiments.py line 73 and the api key in line 2019 gpt_util.py then run the following:

python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

  • Note1: The chatGPT can evaluated on ConceptQA(easy/hard), analogyQA(Easy/Hard), and common word analogy benchmarks, you just need to select the data you want to evaluate on The data, and model can be changed in Additional_Experiments.py line 170

  • Note2: To Evaluate the response, use GPT_eval.ipynb notebook. Some times minor adjustments may be needed.



Trained Models

  • Training the each of the models : Mini-RelBERT, EqualProbR, Sameconcept, similarOffset

  1. Mini-RelBERT
  • First preprocess (change roberta-large to word embedding model of your choice) :
python main.py  --task preprocess  --data wikidata --tokenizer_name roberta-large

  • pre-train with sentential re :
python main.py  --task eval  --data wikidata  --experiment sentential_re_paper  --model_to_train  rc  --tokenizer_name roberta-large

  • Now preprocess (change roberta-large to word embedding model of your choice) :
python main.py  --task preprocess  --data semeval_2012 --tokenizer_name roberta-large

  • To Train
python main.py  --task train  --data semeval_2012  --experiment semeval_2012  --model_to_train  rc  --tokenizer_name roberta-large

  • To Evaluate on word analogy ('change evaluation data on Train_Eval.py, line 612, defualt is sat')
python main.py  --task eval  --data semeval_2012  --experiment semeval_2012  --model_to_train  rc  --tokenizer_name roberta-large


  1. SimilarOffset
  • First preprocess (change roberta-large to word embedding model of your choice) :
python main.py  --task preprocess  --data wordanalogy --tokenizer_name roberta-large

  • Train and Evaluate at once :
ModelName=ModelName_possible_values[0] #Experimens.py line 266-272
backend_trained=backendtrained_possible_values[0] #Experimens.py line 266-272
python main.py  --task train  --data wordanalogy  --experiment wordanalogy  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

  • To Evaluate only ('change evaluation data on Train_Eval.py, line 612, defualt is sat')
python main.py  --task eval  --data wordanalogy  --experiment wordanalogy  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large


  1. EqualProbR and Lexical Relation Classification ( Table5, Tabel6,)
  • EqualProbR for word analogy, we train on lexical relation classification data set(EVALution, and on related Entities from wikidata). This instruction also works for lexcial Relation classification in Table5.

data=['BLESS','EVALution','CogALexV','ROOT09', 'wikidata']

  • First preprocess (change roberta-large to word embedding model of your choice) :
python main.py  --task preprocess  --data dataname --tokenizer_name roberta-large

  • To Train and evaluate on test set
python main.py  --task train  --data dataname  --experiment lexical_offset  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

  • To Evaluate EqualPRob R on word analogy run the follwoing, make sure to set the following:
 ModelName=ModelName_possible_values[0] #Experimens.py line 266-272
 backend_trained=backendtrained_possible_values[0] #Experimens.py line 266-272
 wordanalogy_test_data=['wikidata_easy',] # set to datasets you want to see the reult Experiments.py line 305
python main.py  --task eval  --data dataname  --experiment wordanalogy  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

- Note: all other word anlogy datasets are evauated on EqualProbR model trained on wikdiata, while EVALutionEasy/Hard on model trained on EVALution

MTCQA model

Set the specific experiment related to mtcqa in Multchoice_Model.py line 51 , and then run the following

python main.py  --task train  --data mtcqa  --experiment mtcqa  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

Plots and Additional Experiments From Appendix:

  1. Pretraing with analogyQA-Easy vs Hard
  • Mini-RELBERT Only set args.hard=False(Experiments.py line ) , and train as instructed for MiniRElBert
  • MTCQA :set it in Multichoice_Model.py line 51
  • Baseline: set the following configurations first,
 ModelName=ModelName_possible_values[-1] #Experimens.py line 266-272
 backend_trained=backendtrained_possible_values[-1] #Experimens.py line 266-272
 m='baseline_train' #Experiments.py lines 304-314
 wordanalogy_train_data=['wikidata_hard',] #Experiments.py lines 304-314
  • EqualOFfset: for baseline model set set ModelName='sentence_route' (Experiments.py line 245 ) and set args.similarity_measure='offset' (Experiments.py line 270).
  1. Bayesain Analysis of word semantics and word relations
  • First set experiment_name='cat2_basian_analysis' in Experiments.py line 73, then run the following
python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large


  1. Various Permutation of word Analogy
  • First set experiment_name='cat1_plot_permutations_dist' in Experiments.py line 73, then run the following
python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large


  1. Ineragreement between models
  • First set experiment_name='cat1_interagreement_between_models' in Experiments.py line 73, then run the following

python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

  1. Lexical Relation Classification with Sentence Representation
  • After Preprocess, to Train and evaluate on test set
python main.py  --task train  --data dataname  --experiment sentential_re_paper  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

  1. Cluster of words in the Appendix
  • After setting experiment_name='cat2_pca_top_eignen_vecotr_concept_based' in Experimnts.py line 74 run the following:
python main.py  --task train  --data google  --experiment additional_exp  --model_to_train  wordanalogy_re_model  fasttext

Final Note:

  • We used the code for the loss function from [RelBert](https://github.com/asahi417/relbert) for MiniRelBert.
  • For plots, we used gemini and chatGPT to provide us with templates, and then modified that according to our need.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors