Digital Collection Metadata Assessment and Remediation Project - University of Michigan Library

February 2022 - Present. A pilot project remediating the metadata for the University of Michigan's "The United States and its Territories, 1870 - 1925: The Age of Imperialism" digital colleciton.

Further context on this project and this collection can be found in our report.

Authors

Jackson Huang (huangjq@umich.edu)
Curtis Hunt (huntcu@umich.edu)
Gregory McCollum (gregmcc@umich.edu)

Note

Updated information on how this project handled language metadata updates can be found here.

Workflow

Each step in our process is separated into its own folder in this repository. The description below describes the order, operations, and output of the scripts in each folder.

Alma_Crosswalk contains the whole collections metadata in a MARC (.mrc) file. The alma_to_csv.py script converts this data to a csv, crosswalking them to the same headers as the data held in the DLXS-version of the colleciton metadata, extracting specific MARC fields or joining fields to make them ready for comparison with the DLXS data in the next step. The crosswalked data is held in alma_full.csv.
In DIFF, our diff.py script compares the two versions of the colleciton metadata held in alma_full.csv and dxls_full.csv. Our script matches collection items on their mms id numbers and then loops through each of the item's fields, flagging any differences beween the two (with some stripping away of formatting differences and whitespace). If differences existed between a field value, those fields were written to the matches.csv document. Additionally, because we expected the Alma metadata to be more robust and up-to-date, we flagged in our matches.csv document any field where the length of the DLXS value was longer for further investigation.
We work with this matches.csv document further in Building_CheckLists. In building_checklists.py, we loop through each record in the matches.csv document along with the alma_full.csv and dxls_full.csv documents and developed a "check list" for each metadata attribute in the CheckLists subdirectory, populated by ALMA and DLXS values for each record that was flagged as having a longer DLXS value. These CSVs were uploaded to Google Sheets for manual review by the team. Additionally, a running, consolidated 'best version' CSV was created with field values not DLXS-longer flagged, prefering the ALMA values where any discrepenacy existed.
Consolidating_Values contains an Edited_Sheets subdirectory with the check lists established in the previous step, but with our team's manually reviewed reccomednations included in them. We loop through the best_values.csv and add our new suggested field values from these editied sheets in our merging_best_values.py script. The consolidated version with complete field values is held in the new_full_best_values.csv document.
The Catalog_Linking directory contains scripts that develop a CSV of information inlcuding a search URL for the University of Michigan Library catalog on the basis of the title of each record. Some of the main records for this colleciton in the U of M catalog lack a link to the digital collection item. We used these sheets to manually review the catalog search results in Google Sheets to identify these records without digital colleciton links and adding the catalog identification numbers for the corresponding digital collection items.
In Subject Sorting we develop suggested tags for each record on the basis of place. In our subject_test.py we set up a brief series of Regular Expressions corresponding to each of the places represented. Our Regular Expressions sought to capture historical and local vartions in place names ("Puerto Rico" "Porto Rico"), demonyms ("Puerto Rican", "Puerto Ricans"), and usage of diacritical marks ("Hawaii", "Hawaiʻi"). We then loop through the full collection metadata and search for these expressions in each record's keywords, counting the instances of these terms appearing within the keywords. Our script then suggests a place tag on the basis of which term appears (or appears most) within a records keyword string. For ties, or situtations in which no regualr expression was found, a second loop checking the records titles was conducted as well. The results of these tests can be found in subject_entries.csv, with full tag counts available in subjet_counts.json.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Alma_Crosswalk		Alma_Crosswalk
Building_CheckLists		Building_CheckLists
Catalog_Linking		Catalog_Linking
Consolidating_Values		Consolidating_Values
Diff		Diff
Subject_Sorting		Subject_Sorting
__pycache__		__pycache__
.DS_Store		.DS_Store
README.md		README.md
compare_lang_data.py		compare_lang_data.py
lang_changes.csv		lang_changes.csv
metadata_utils.py		metadata_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digital Collection Metadata Assessment and Remediation Project - University of Michigan Library

Authors

Note

Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Digital Collection Metadata Assessment and Remediation Project - University of Michigan Library

Authors

Note

Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages