Joshua Black
University of Canterbury
joshua.black@canterbury.ac.nz
black.joshuad@gmail.com
This repository contains a series of Jupyter notebooks which together present a method for creating corpora for digital humanities research from large datasets of METS/ALTO files of the sort generated in many newspaper digitisation projects.
The origin of this method is in a project to investigate philosophical discourse in early New Zealand newspapers. Here 'early' is taken to mean pre-1900. This definition was adopted on purely pragmatic grounds: in 2019 the National Library of New Zealand Te Puna Mātauranga o Aotearoa released a large dataset of English-language newspaper content up to the year 1899 (details here).
The method is summarised in the following image:

Each stage has a notebook in the Notebooks directory. The model fit and application stages in the same notebook.
Each stage of the method described above has its own worksheet in the Notebooks directory. The names of these should be self-explanatory.
Other directories:
Corpora: directory for candidate corpora.Dataset: a directory to put the full national library dataset and the processed data.Labels: contains labels for the first iteration of the process.Classifiers: contains a classifier trained on the first iteration of labels.TopicModels: contains an LDA model generated in the first iteration of the corpus exploration process as set out in the relevant notebook in theWorksheetsdirectory.Dictionaries: contains agensimdictionary used during the first iteration of corpus exploration.Pickles: Some pre-generated metadata for the early New Zealand newspaper dataset.Presentation: contains apresentation on the project delivered to the DHA 2021 conference.
The file packages_list.txt contains all packages in the anaconda environment used in the most recent run of these notebooks. It can be used to install all required packages (see here).
Processed data is available here (place in the Dataset directory), sample candidate corpora are available here (place in Corpora directory).
Sample results and pre-generated cooccurrence networks from an earlier version of this project are available here.