Skip to content

roncewind/WordEmbedding

Repository files navigation

WordEmbedding

Place to tinker with transformers to create embeddings. In particular, I'd like to see if we can group names of domesticated animals in different languages together.

For example,

Take the groups, "dog, chien, perra, σκύλος, イヌ" vs "rabbit, lapin, conejo, κουνέλι, ウサギ"

We'd like to see a similarity score between names inside the groups be closer together than names in the other group. Let's assume sim() a similarity function maybe cosine similarity for this example:

sim(dog, perra) closer to 1 and sim(dog, rabbit) closer to -1

Use a wikidata extract to create a list of types of pets

Note: the download is large and takes quite some time, so it's best to download from a dated directory

Look here: https://dumps.wikimedia.org/wikidatawiki/entities/ for a list of what's available, then download with curl or something. Choose the .bz2 file, it's smaller.

For example:

curl --retry 9 -C - -L -R -O https://dumps.wikimedia.org/wikidatawiki/entities/20250707/wikidata-20250707-all.json.bz2

Run the extractor to build a data set (optional)

You can always build your own dataset, we're just using wikidata since it's fairly easy to get ahold of and try out.

Seems faster to if decompression is done outside of python, though the python program will decompress on its own if passed a compressed file. (.gz,.bz2) It also accepts a piped in file.

pbzip2 -d -c -m200 ~/data/wikidata-20250526-all.json.bz2| python extract_wikidata_pets.py -i - -o data/20250526_pets_wikidata.csv 2> 20250526_err.out

This outputs a CSV file with the following columns:

id,canonical,language,name

The id is the wikidata id, like Q144 The canonical is a canonical name that's easier for me to read than the id or sometimes the name column The language column is a language code as reported by wikidata. The name column is the name in the specified language or at least as reported by wikidata.

Inside extract_wikidata_pets.py is just a starter for filtering data from the wikidata extract. There are a couple of # TUNE HERE comments for the following:

  1. The INSTANCE_OF_WHITELIST list can be tuned, if you want to extract something other than subclasses of what's in that list.
  2. Also it is not using the "instance of" (P31) property in this case, instead we're using "subclass of" (P279) instead. Turns out that domesticated animals are "subclass of" not "instances of". So, if what you're looking to train on is an "instance of" something, you'll want to change this... or rewrite the entity selection criteria all together.

Analyze the extracted data (optional)

Using analyze_wikidata_extract.py to get some information about your dataset.

For example:

python analyze_wikidata_extract.py --csv_path data/20250707_pets_wikidata.csv --min_large 50 --min_small 2
  • min_large denotes the number of examples a group must have to be considered a "large" group
  • min_small denotes the number of examples a group must have to be considered a "small" group

This will count the number of large, small, and total groups there are. This information is useful when training your model. This will help in defining training/validation/test data sets.

Train a new transformer

Using train_sbert_contrastive.py fine tune a sentence transformer to group pets of in different languages.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages