“Just do register analysis”, they said. “It’ll be easy”, they said: Evaluating the impact of lexico-grammatical tagging systems on multidimensional analysis
This repository contains data, code and analyses for our submission to LRE.
texts_raw/quotes_nonquotes_corpus contains the combined corpus data from the 19C (19th century novels) and DNov (Dickens' novels) corpora also available here. In our case, however, there are two files for each novel: one containing only dialogue, one containing only narrative text. These were extracted using the CLiC API.
output_MAT contains MAT's outputs for all individual parts of our corpus.
output_pseudobiber/quotes_nonquotes_corpus contains pseudobibeR's feature counts for Stanza and spaCy, as well as runtimes for spaCy. Stanza's CoNLL-U output and runtime can be found in output_stanza.
To replicate these outputs, the following scripts can be used:
annotate_stanza.pyruns Stanza to create CoNLL-U outputfeature_counts_pseudobiber_stanza.Rcreates feature counts based on this outputfeature_counts_pseudobiber_spacy.Rruns different spaCy models and creates feature counts based on their outputs
Note that both R scripts use an edited version of pseudobibeR (pseudobiber_changes.R) to be able to compute type-token ratios with a different window size. This makes them much slower (output_pseudobiber/quotes_nonquotes_corpus/spacy_timings.csv). Runtimes reported in the paper are therefore based on the unedited package (see output_pseudobiber/quotes_nonquotes_corpus/spacy_timings_ttr100.csv).
output_pybiber contains pybiber's feature counts and runtimes.
To replicate these outputs, feature_counts_pybiber.py can be used. Alternatively, feature_counts_pybiber_pipeline.py could also be used -- it is, however, much slower.
output_biberplus contains BiberPlus' feature counts and runtimes.
To replicate these outputs, feature_counts_biberplus.py can be used.
output_biberpy contains Biberpy's feature counts and runtimes.
To replicate these results:
- first, run
biberpy/spacydir2json.pyfrom the command line:python spacydir2json.py quotes_nonquotes_corpus en_core_web_sm >quotes_nonquotes.json(adjust paths as needed:quotes_nonquotes_corpusneeds to point to the folder containing raw text files,quotes_nonquotes.jsonwill be the output file) - then, you will need the Biberpy scripts (this is not a fully operational Python package)
- replace Biberpy's
biberpy.pywith our edited version (biberpy/biberpy.py) so that all features are normalised by token count (otherwise, Biberpy uses an approximation of clause count for several features) - run Biberpy's
biber-dim.pyfrom the command line:python biber-dim.py -f json -l en -i quotes_nonquotes.json -o output_biberpy_sm.tsv(again, adjust paths as needed)
analysis.qmd is a Quarto document containing our analyses. Easily readable HTML output is also available.