Cobweb/4L in Language Tasks

The implementation of Cobweb/4L use in language tasks. This README.md file is pdated on May 14, 2024 by Xin Lian.

The associated ACS-24 paper: to be updated

The process of how Cobweb/4L learns an additional new instance after learning 6 instances. The process is indeed the same as the one in Cobweb (but here in particular, the information-theoretic category utility is used for evaluating the operation for each traversed concept node).

Prerequisites

Install the cobweb Python package here.
Install the spaCy package and download the tokenizer en_core_web_sm.
To compare with Word2Vec (CBOW), make sure you have installed PyTorch as well.

To see how the data here is used (Sherlock Holmes stories from Microsoft Research Sentence Completion Challenge), see here.

Train Cobweb

Run the script ./cobweb-tr.py:

python3 cobweb-tr.py

There are some predefined global variables for configuration:

Preprocessing

window: The window size of an instance (so the number of context words included in an instance is 2 times window)
least_frequency: The filter-out frequency frontier. So the words appear less than least_frequency will be filtered out when generating tokens from raw stories data.
seed: Random seed. Used in shuffling.
keep_stop_word: Whether to keep the stop words (referred to the tokenizer) in the stories tokens.
stop_word_as_anchor: Whether to keep the generated instacnes whose anchor words are stop words.
dummy: If it is True, we only train 2 of the Holmes stories in the folder ./data-tr-dummy instead of ./data-tr.
preview: Whether to preview the generated tokens and instances.

Instances

start_instance_id: The index of the first considered instance. If none is specified just keep it None.
end_instance_id: The index of the last considered instance. If none is specified just keep it None.
train_portion: The portion of the considered instances included in the training set (from 0 to 1).
have_test_portion: If True then introduce a test set including the last (1 - train_portion) instances.
n_tr_splits: The number of training splits in the training set.
split_size: The size of each training split.

Note that n_tr_splits prioritizes split_size - so split_size is considered only when n_tr_splits = None.

Training Process

start_split: The index of the split starts with training
load_model: If True initiates the model module with pretrained model file specified by load_model_file, otherwise False.

Test Cobweb

Run the script ./cobweb-te.py with specified parameters:

python3 cobweb-te.py --nodes=1000 --split=0

where nodes specifies the number of expanded nodes used in prediction, and split specifies the index of the tested split (checkpoint/training split), which is used to specify the loaded model file name and the output stat summary file name.

There are some predefined global variables for configuration:

window: The window size of an instance (generated from the testing sentences)
seed: The random seed used for shuffling.
used_cores: The number of CPU cores used in parallel prediction.
keep_stop_word: Whether to keep the stop words (referred to the tokenizer) in the testing data.
stop_word_as_anchor: Whether to keep the generated instacnes whose anchor words are stop words in the testing data.
default_answer: The default trivial answer for a prediction when Cobweb cannot reach an answer. Select from one of these options: 'a', 'b', 'c', 'd', 'e'.
load_model_file: The model file path name for preiction
output_file: The output stats summary csv file path name

Train CBOW

Run

python3 cbow-tr.py

Many predefined global variables are the same as the ones in cobweb implementation. Some additional ones:

max_vocab_size: The maximum vocab size of CBOW model (in its vector space).
embedding_dim: The length of vector representation of each word.
batch_size: The batch size.
epochs: Number of epochs trained.
use_cuda: If True then set the device to cuda if available. Otherwise False.

Test CBOW

Run

python3 cbow-te.py

With specified global variables in the script.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
data-te		data-te
data-tr-dummy		data-tr-dummy
data-tr		data-tr
illustrations		illustrations
models-saved		models-saved
models		models
test-summary		test-summary
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
cbow-te.py		cbow-te.py
cbow-tr.py		cbow-tr.py
cobweb-te-single.py		cobweb-te-single.py
cobweb-te.py		cobweb-te.py
cobweb-tr.py		cobweb-tr.py
utility.py		utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cobweb/4L in Language Tasks

Prerequisites

Train Cobweb

Preprocessing

Instances

Training Process

Test Cobweb

Train CBOW

Test CBOW

About

Uh oh!

Releases

Packages

Languages

License

Teachable-AI-Lab/cobweb-language

Folders and files

Latest commit

History

Repository files navigation

Cobweb/4L in Language Tasks

Prerequisites

Train Cobweb

Preprocessing

Instances

Training Process

Test Cobweb

Train CBOW

Test CBOW

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages