“Just do register analysis”, they said. “It’ll be easy”, they said: Evaluating the impact of lexico-grammatical tagging systems on multidimensional analysis

This repository contains data, code and analyses for our submission to LRE.

Raw texts

texts_raw/quotes_nonquotes_corpus contains the combined corpus data from the 19C (19th century novels) and DNov (Dickens' novels) corpora also available here. In our case, however, there are two files for each novel: one containing only dialogue, one containing only narrative text. These were extracted using the CLiC API.

Register tagging systems

MAT

output_MAT contains MAT's outputs for all individual parts of our corpus.

pseudobibeR

output_pseudobiber/quotes_nonquotes_corpus contains pseudobibeR's feature counts for Stanza and spaCy, as well as runtimes for spaCy. Stanza's CoNLL-U output and runtime can be found in output_stanza.

To replicate these outputs, the following scripts can be used:

annotate_stanza.py runs Stanza to create CoNLL-U output
feature_counts_pseudobiber_stanza.R creates feature counts based on this output
feature_counts_pseudobiber_spacy.R runs different spaCy models and creates feature counts based on their outputs

Note that both R scripts use an edited version of pseudobibeR (pseudobiber_changes.R) to be able to compute type-token ratios with a different window size. This makes them much slower (output_pseudobiber/quotes_nonquotes_corpus/spacy_timings.csv). Runtimes reported in the paper are therefore based on the unedited package (see output_pseudobiber/quotes_nonquotes_corpus/spacy_timings_ttr100.csv).

pybiber

output_pybiber contains pybiber's feature counts and runtimes.

To replicate these outputs, feature_counts_pybiber.py can be used. Alternatively, feature_counts_pybiber_pipeline.py could also be used -- it is, however, much slower.

BiberPlus

output_biberplus contains BiberPlus' feature counts and runtimes.

To replicate these outputs, feature_counts_biberplus.py can be used.

Biberpy

output_biberpy contains Biberpy's feature counts and runtimes.

To replicate these results:

first, run biberpy/spacydir2json.py from the command line: python spacydir2json.py quotes_nonquotes_corpus en_core_web_sm >quotes_nonquotes.json (adjust paths as needed: quotes_nonquotes_corpus needs to point to the folder containing raw text files, quotes_nonquotes.json will be the output file)
then, you will need the Biberpy scripts (this is not a fully operational Python package)
replace Biberpy's biberpy.py with our edited version (biberpy/biberpy.py) so that all features are normalised by token count (otherwise, Biberpy uses an approximation of clause count for several features)
run Biberpy's biber-dim.py from the command line: python biber-dim.py -f json -l en -i quotes_nonquotes.json -o output_biberpy_sm.tsv (again, adjust paths as needed)

Analysis

analysis.qmd is a Quarto document containing our analyses. Easily readable HTML output is also available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

“Just do register analysis”, they said. “It’ll be easy”, they said: Evaluating the impact of lexico-grammatical tagging systems on multidimensional analysis

Raw texts

Register tagging systems

MAT

pseudobibeR

pybiber

BiberPlus

Biberpy

Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
biberpy		biberpy
figures		figures
output_MAT		output_MAT
output_MFTE/quotes_nonquotes_corpus		output_MFTE/quotes_nonquotes_corpus
output_biberplus		output_biberplus
output_biberpy		output_biberpy
output_pseudobiber/quotes_nonquotes_corpus		output_pseudobiber/quotes_nonquotes_corpus
output_pybiber		output_pybiber
output_stanza		output_stanza
texts_raw/quotes_nonquotes_corpus		texts_raw/quotes_nonquotes_corpus
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
analysis.html		analysis.html
analysis.qmd		analysis.qmd
annotate_stanza.py		annotate_stanza.py
dims_biber1988.csv		dims_biber1988.csv
egbert_mahlberg_dimension_scores.csv		egbert_mahlberg_dimension_scores.csv
feature_counts_biberplus.py		feature_counts_biberplus.py
feature_counts_pseudobiber_spacy.R		feature_counts_pseudobiber_spacy.R
feature_counts_pseudobiber_stanza.R		feature_counts_pseudobiber_stanza.R
feature_counts_pybiber.py		feature_counts_pybiber.py
feature_counts_pybiber_pipeline.py		feature_counts_pybiber_pipeline.py
features_loadings_biber1988.xlsx		features_loadings_biber1988.xlsx
pseudobiber_changes.R		pseudobiber_changes.R

Folders and files

Latest commit

History

Repository files navigation

“Just do register analysis”, they said. “It’ll be easy”, they said: Evaluating the impact of lexico-grammatical tagging systems on multidimensional analysis

Raw texts

Register tagging systems

MAT

pseudobibeR

pybiber

BiberPlus

Biberpy

Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages