Philosophical Writing in Early New Zealand Newspapers (Corpus Creation Workflow)

Joshua Black
University of Canterbury
joshua.black@canterbury.ac.nz
black.joshuad@gmail.com

Overview

This repository contains a series of Jupyter notebooks which together present a method for creating corpora for digital humanities research from large datasets of METS/ALTO files of the sort generated in many newspaper digitisation projects.

The origin of this method is in a project to investigate philosophical discourse in early New Zealand newspapers. Here 'early' is taken to mean pre-1900. This definition was adopted on purely pragmatic grounds: in 2019 the National Library of New Zealand Te Puna Mātauranga o Aotearoa released a large dataset of English-language newspaper content up to the year 1899 (details here).

The Method

The method is summarised in the following image:

Each stage has a notebook in the Notebooks directory. The model fit and application stages in the same notebook.

File Structure

Each stage of the method described above has its own worksheet in the Notebooks directory. The names of these should be self-explanatory.

Other directories:

Corpora: directory for candidate corpora.
Dataset: a directory to put the full national library dataset and the processed data.
Labels: contains labels for the first iteration of the process.
Classifiers: contains a classifier trained on the first iteration of labels.
TopicModels: contains an LDA model generated in the first iteration of the corpus exploration process as set out in the relevant notebook in the Worksheets directory.
Dictionaries: contains a gensim dictionary used during the first iteration of corpus exploration.
Pickles: Some pre-generated metadata for the early New Zealand newspaper dataset.
Presentation: contains a presentation on the project delivered to the DHA 2021 conference.

The file packages_list.txt contains all packages in the anaconda environment used in the most recent run of these notebooks. It can be used to install all required packages (see here).

Content Files

Processed data is available here (place in the Dataset directory), sample candidate corpora are available here (place in Corpora directory).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Classifiers		Classifiers
Dictionaries		Dictionaries
Labels		Labels
Notebooks		Notebooks
Pickles		Pickles
TopicModels		TopicModels
README.md		README.md
environment.yml		environment.yml
flow_diagram.png		flow_diagram.png
package_list.txt		package_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Philosophical Writing in Early New Zealand Newspapers (Corpus Creation Workflow)

Overview

The Method

File Structure

Content Files

Other links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Philosophical Writing in Early New Zealand Newspapers (Corpus Creation Workflow)

Overview

The Method

File Structure

Content Files

Other links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages