The purpose of this code is to provide consistent Marc record parsing for deduplication, in order to compare how humans, a machine learning deduplication algorithm, and an implementation of the GoldRush algorithm deduplicate Marc records.
The intention is that the output of the current MarcRecord methods be human-readable and used for the machine learning deduplication algorithm, and the GoldRush methods be used to build a string for literal matching.
The implementation of the GoldRush algorithm is based on the Colorado Alliance MARC record match key generation, documented January 12, 2024.
This application will provide two layers of normalization.
The first layer of normalization consists of selecting a subset of Marc fields and subfields for human and machine learning algorithm comparison.
This will include showing fields in the vernacular script when available. Since not everyone is familiar with different scripts, these will be presented with both the transliterated information and the vernacular script. The vernacular script is more likely to be accurately matched by both the machine learning algorithm and humans who are familiar with that script, the transliterated script is more likely to be accurately matched by humans who are not familiar with the vernacular script.
The second layer of normalization will be built on the first layer of normalization, and will be an interpretation of the GoldRush algorithm, intended for exact string matching.
To this end, there will be much more strict string normalization in this layer. Only vernacular versions of fields will be preserved.
- Some normalization strongly favors English-language texts - e.g.
- Replacing English-language articles at the beginnings of titles
- This also seems like it duplicates the 245 second indicator for non-filing characters
- Replacing '&' with 'and'
- Replacing English-language articles at the beginnings of titles
-
Set up the environment, as described below
-
Call the python script with arguments for the file, pair of files, or directory you want to compare.
For the main.py script, you can give either MarcXML or JSON files, and either one file (will find duplicates within that files) or two files (will find duplicates within and between the files) - file1 is required, file2 and dir are not required.
Compare two MarcXML files
python main.py --file1="tests/fixtures/alma_marc_records_short.xml" --file2="tests/fixtures/alma_marc_records.xml" --dir="experiments_files_and_output"Find duplicates in a single JSON file
python main.py --file1="tests/fixtures/marc_records.json"Find duplicates from files in a directory
python db_main.py --input_dir="db_input_files" --output_dir="db_experiments"
# short experiment
python db_main.py --input_dir="tests/fixtures/for_db"
# for comparison
python db_main.py --input_dir="/Users/kadelm/projects/dedup_for_comparison" --output_dir="for_comparison"- If you do not already have settings and training data, it will open an interactive session in your terminal to see whether you, as a human, think two things are duplicates or not, to train the Machine Learning algorithm. Follow the instructions in your terminal
- It will output a CSV of all the records you input, with three added columns: a. cluster_id - all records that it thinks are matches of each other will have the same cluster_id. If a record does not have a cluster_id, that means the machine learning algorithm does not think it has any duplicates. b. cluster_score - how confident the algorithm is that the record belongs to its cluster. The higher the number, the more likely the record is a true match c. source_file - which file the record displayed is from
- Make a .venv
python3 -m venv .venv- activate the environment
. .venv/bin/activate- install dependencies
pip install -r requirements/[environment].txt
pip install -r requirements/development.txt
OR
pip install -r requirements/common.txtBring up the database using lando
lando start- Create a .env file with the appropriate environment from the settings.toml file
cp .env.example .envUncomment the line in the pyproject.toml setting the ENV_FOR_DYNACONF
pytest- ruff - fast
- Formatter -
--checkflag does not make changes. Run without--checkflag for automatic fixing
ruff format . --check- Linter
ruff check .- pylint - slower, does more in-depth checks
- Currently excluding checks for documentation - remove these disables once this is remediated
pylint src tests main.py --disable=C0114,C0115,C0116