Dataset

To reprocdue the result for the paper :

Offset Similarity in Semantically Similar Words

Dataset

Download Sentential RE DATA, if you may train EqualProbR, otherwise skip:

Wikidata

Download only the folowing word analogy question answering datasets(For review we have included these files as tar.gz, only extract it and replace the essential_files and unprocessed_data in home directory with extracted folders of same names)

sat, u2,u4,bats,google download and place it under the folder : unprocessed_data/analogy_data.
scan after download/clone data from Tamara Czinczoll and place test.jsonl, and valid.jsonl under the foler :unprocessed_data/some_extra/scan/.
distractor verbal analogy from Jones et. al and place it under the foler unprocessed_data/analogy_data/osfstorage-archive/
google_easy,google_hard was generated from the google analogy data by by Mikolov et a
RS after download/clone data from ushio-etal-2021 and place test.jsonl, and valid.jsonl under the foler :unprocessed_data/some_extra/RS/
ekar data from chen-etal-2022 can be used with hugginface library.

Download following chekcing points for some of the word embedding models:

Fasttext extract and place in the folder fasttext
word2vec extract and place it in the loves-word2vec
Glove extract and place in the folder gloves-word2vec

Setup

pip install git+https://github.com/glassroom/heinsen_routing
pip install transformers
pip3 install torch torchvision
pip installl tqdm

Running

Preprocessing Analogy

preprocess General: \

First run the fllowing to make multiple choice analogy questions, easy and hard:

python preprocess_word_analogy.py

Preprocess Specific:

The following data can be proprocess with the commmand below
datanames=['conceptqa','wordanalogy','semeval_2012']+ ['BLESS','EVALution','CogALexV','ROOT09']

Tokenizer_Names=['bert-large-uncased','gpt1','roberta-large','opt','prophetnet','t5-large']

python main.py  --task preprocess  --data *dataName*  --tokenizer_name tokenizerName

Evaluation and Training

unsupervised vector Offset(without training)

To evaluate the word embedding models on AnalogyQA-Easy or Hard as in the Table2, effecient approach is to run the follwoing commands in order:

First preprocess (change roberta-large to word embedding model of your choice) :

python main.py  --task preprocess  --data wordanalogy --tokenizer_name roberta-large

Now To evalaute (with roberta-large)run the following command Make sure the following configurations are correct: Configurations:

 ModelName=ModelName_possible_values[0] #Experimens.py line 266-272
 backend_trained=backendtrained_possible_values[-1] #Experimens.py line 266-272
 ModelName=ModelName_possible_values[-1] # Experiments.py line 266
 wordanalogy_test_data=['wikidata_easy',] # set to datasets you want to see the reult Experiments.py line 305

Command:

python main.py  --task train  --data wordanalogy  --experiment wordanalogy  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

For fasttext, and also other models (except trained models and gpt4.0) `first set the variable experiment_name='cat1_solve_analogies' and then run the following: Note: change the data and model in lines 169-172 , Additional_Experiments.py (default is fasttext).


python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

To collect response from ChatGPT : first set the variable experiment_name=evaluate_gpt, in Experiments.py line 73 and the api key in line 2019 gpt_util.py then run the following:


python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

Note1: The chatGPT can evaluated on ConceptQA(easy/hard), analogyQA(Easy/Hard), and common word analogy benchmarks, you just need to select the data you want to evaluate on The data, and model can be changed in Additional_Experiments.py line 170
Note2: To Evaluate the response, use GPT_eval.ipynb notebook. Some times minor adjustments may be needed.

Trained Models

Training the each of the models : Mini-RelBERT, EqualProbR, Sameconcept, similarOffset

Mini-RelBERT

First preprocess (change roberta-large to word embedding model of your choice) :

python main.py  --task preprocess  --data wikidata --tokenizer_name roberta-large

pre-train with sentential re :

python main.py  --task eval  --data wikidata  --experiment sentential_re_paper  --model_to_train  rc  --tokenizer_name roberta-large

Now preprocess (change roberta-large to word embedding model of your choice) :

python main.py  --task preprocess  --data semeval_2012 --tokenizer_name roberta-large

To Train

python main.py  --task train  --data semeval_2012  --experiment semeval_2012  --model_to_train  rc  --tokenizer_name roberta-large

To Evaluate on word analogy ('change evaluation data on Train_Eval.py, line 612, defualt is sat')

python main.py  --task eval  --data semeval_2012  --experiment semeval_2012  --model_to_train  rc  --tokenizer_name roberta-large

SimilarOffset

First preprocess (change roberta-large to word embedding model of your choice) :

python main.py  --task preprocess  --data wordanalogy --tokenizer_name roberta-large

Train and Evaluate at once :

ModelName=ModelName_possible_values[0] #Experimens.py line 266-272
backend_trained=backendtrained_possible_values[0] #Experimens.py line 266-272

python main.py  --task train  --data wordanalogy  --experiment wordanalogy  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

To Evaluate only ('change evaluation data on Train_Eval.py, line 612, defualt is sat')

python main.py  --task eval  --data wordanalogy  --experiment wordanalogy  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

EqualProbR and Lexical Relation Classification ( Table5, Tabel6,)

EqualProbR for word analogy, we train on lexical relation classification data set(EVALution, and on related Entities from wikidata). This instruction also works for lexcial Relation classification in Table5.

data=['BLESS','EVALution','CogALexV','ROOT09', 'wikidata']

First preprocess (change roberta-large to word embedding model of your choice) :

python main.py  --task preprocess  --data dataname --tokenizer_name roberta-large

To Train and evaluate on test set

python main.py  --task train  --data dataname  --experiment lexical_offset  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

To Evaluate EqualPRob R on word analogy run the follwoing, make sure to set the following:

 ModelName=ModelName_possible_values[0] #Experimens.py line 266-272
 backend_trained=backendtrained_possible_values[0] #Experimens.py line 266-272
 wordanalogy_test_data=['wikidata_easy',] # set to datasets you want to see the reult Experiments.py line 305

python main.py  --task eval  --data dataname  --experiment wordanalogy  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

- Note: all other word anlogy datasets are evauated on EqualProbR model trained on wikdiata, while EVALutionEasy/Hard on model trained on EVALution

MTCQA model

Set the specific experiment related to mtcqa in Multchoice_Model.py line 51 , and then run the following

python main.py  --task train  --data mtcqa  --experiment mtcqa  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

Plots and Additional Experiments From Appendix:

Pretraing with analogyQA-Easy vs Hard

Mini-RELBERT Only set args.hard=False(Experiments.py line ) , and train as instructed for MiniRElBert
MTCQA :set it in Multichoice_Model.py line 51
Baseline: set the following configurations first,

 ModelName=ModelName_possible_values[-1] #Experimens.py line 266-272
 backend_trained=backendtrained_possible_values[-1] #Experimens.py line 266-272
 m='baseline_train' #Experiments.py lines 304-314
 wordanalogy_train_data=['wikidata_hard',] #Experiments.py lines 304-314

EqualOFfset: for baseline model set set ModelName='sentence_route' (Experiments.py line 245 ) and set args.similarity_measure='offset' (Experiments.py line 270).

Bayesain Analysis of word semantics and word relations

First set experiment_name='cat2_basian_analysis' in Experiments.py line 73, then run the following

python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

Various Permutation of word Analogy

First set experiment_name='cat1_plot_permutations_dist' in Experiments.py line 73, then run the following

python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

Ineragreement between models

First set experiment_name='cat1_interagreement_between_models' in Experiments.py line 73, then run the following

python main.py  --task train  --data wordanalogy  --experiment additional_exp  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

Lexical Relation Classification with Sentence Representation

After Preprocess, to Train and evaluate on test set

python main.py  --task train  --data dataname  --experiment sentential_re_paper  --model_to_train  wordanalogy_re_model  --tokenizer_name roberta-large

Cluster of words in the Appendix

After setting experiment_name='cat2_pca_top_eignen_vecotr_concept_based' in Experimnts.py line 74 run the following:

python main.py  --task train  --data google  --experiment additional_exp  --model_to_train  wordanalogy_re_model  fasttext

Final Note:

We used the code for the loss function from [RelBert](https://github.com/asahi417/relbert) for MiniRelBert.
For plots, we used gemini and chatGPT to provide us with templates, and then modified that according to our need.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
data		data
essential_files		essential_files
evaluation_data		evaluation_data
fasttext		fasttext
gloves-word2vec		gloves-word2vec
images		images
models_saved		models_saved
pretrained		pretrained
unprocessed_data		unprocessed_data
.DS_Store		.DS_Store
Additional_Experiments.py		Additional_Experiments.py
Analogy_Model.py		Analogy_Model.py
Experiment_Util.py		Experiment_Util.py
Experiments.py		Experiments.py
Multichoice_Model.py		Multichoice_Model.py
RC_Model.py		RC_Model.py
README.md		README.md
Train_Eval.py		Train_Eval.py
Wikidata_Abstraction.py		Wikidata_Abstraction.py
abstraction.json		abstraction.json
analogy_data.py		analogy_data.py
analogy_util.py		analogy_util.py
args.py		args.py
gpt_eval.ipynb		gpt_eval.ipynb
gpt_util.py		gpt_util.py
main.py		main.py
preprocess_util.py		preprocess_util.py
preprocess_word_analogy.py		preprocess_word_analogy.py
relbert_test.py		relbert_test.py
wordanalogyrel_dic.json		wordanalogyrel_dic.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Offset Similarity in Semantically Similar Words

Dataset

Setup

Running

Preprocessing Analogy

Evaluation and Training

unsupervised vector Offset(without training)

Trained Models

MTCQA model

Plots and Additional Experiments From Appendix:

Final Note:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Offset Similarity in Semantically Similar Words

Dataset

Setup

Running

Preprocessing Analogy

Evaluation and Training

unsupervised vector Offset(without training)

Trained Models

MTCQA model

Plots and Additional Experiments From Appendix:

Final Note:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages