Place to tinker with transformers to create embeddings. In particular, I'd like to see if we can group names of domesticated animals in different languages together.
For example,
Take the groups, "dog, chien, perra, σκύλος, イヌ" vs "rabbit, lapin, conejo, κουνέλι, ウサギ"
We'd like to see a similarity score between names inside the groups be closer together than
names in the other group. Let's assume sim() a similarity function maybe cosine similarity for this example:
sim(dog, perra) closer to 1 and sim(dog, rabbit) closer to -1
Download Wikidata from: https://dumps.wikimedia.org/wikidatawiki/entities/
Note: the download is large and takes quite some time, so it's best to download from a dated directory
Look here: https://dumps.wikimedia.org/wikidatawiki/entities/ for a list of what's available, then download with curl or something. Choose the .bz2 file, it's smaller.
For example:
curl --retry 9 -C - -L -R -O https://dumps.wikimedia.org/wikidatawiki/entities/20250707/wikidata-20250707-all.json.bz2
You can always build your own dataset, we're just using wikidata since it's fairly easy to get ahold of and try out.
Seems faster to if decompression is done outside of python, though the python program will decompress on its own if passed a compressed file. (.gz,.bz2) It also accepts a piped in file.
pbzip2 -d -c -m200 ~/data/wikidata-20250526-all.json.bz2| python extract_wikidata_pets.py -i - -o data/20250526_pets_wikidata.csv 2> 20250526_err.out
This outputs a CSV file with the following columns:
id,canonical,language,name
The id is the wikidata id, like Q144
The canonical is a canonical name that's easier for me to read than the id or sometimes the name column
The language column is a language code as reported by wikidata.
The name column is the name in the specified language or at least as reported by wikidata.
Inside extract_wikidata_pets.py is just a starter for filtering data from the wikidata extract. There are a couple of # TUNE HERE comments for the following:
- The INSTANCE_OF_WHITELIST list can be tuned, if you want to extract something other than subclasses of what's in that list.
- Also it is not using the "instance of" (P31) property in this case, instead we're using "subclass of" (P279) instead. Turns out that domesticated animals are "subclass of" not "instances of". So, if what you're looking to train on is an "instance of" something, you'll want to change this... or rewrite the entity selection criteria all together.
Using analyze_wikidata_extract.py to get some information about your dataset.
For example:
python analyze_wikidata_extract.py --csv_path data/20250707_pets_wikidata.csv --min_large 50 --min_small 2
min_largedenotes the number of examples a group must have to be considered a "large" groupmin_smalldenotes the number of examples a group must have to be considered a "small" group
This will count the number of large, small, and total groups there are. This information is useful when training your model. This will help in defining training/validation/test data sets.
Using train_sbert_contrastive.py fine tune a sentence transformer to group pets of in different languages.