This package implements part-of-speech tagging of Riksdagens Protokoll (Parla-CLARIN)[https://clarin-eric.github.io/parla-clarin/] files.
- The workflow makes use of
Gnu make,git,pyenvandpoetry. - Latest version of welfare-state-analytics/pyriksprot.
- Latest version of welfare-state-analytics/pyriksprot_tagger.
- NLP models for Sparv and Stanza installed.
- A local copy of riksdagen-corpus Github repository.
- Update
pyriksprot-taggerconfiguration file (.env). - Update riksdagen-corpus repository.
- Run the
tag.shscript:You can also execute a predefined make recepi:PYTHONPATH=. nohup ./tag.sh --target-folder /path/to/output/data > tag-it.version.log &
make tag-it
If you run tag.sh without parameters then the values found in .env will be used. You can also specify
parameters as command line options:
usage: ./tag.sh [--data-folder folder] [--source-pattern pattern] --target-folder folder --tag tag [--force] [--update] [--max-procs n]]
Creates new database using source as template. Source defaults to production.
--data-folder source root folder
--source-pattern source folder pattern
--target-folder target folder
--tag source corpus tag
--force drop target if exists
--update update target if exists
--max-procs max number of parallel jobsNote that tag.sh will raise an error if the checkout tag in the Git repository and tag specified in .env (or as a parameter) mismatch.
This workflow processes the corpus metadata and generates an Sqlite relational database. This database is used by the Westac Notebooks when filtering and pivoting data based on speaker, party etc. Use welfare-state-analytics/pyriksprot to create or update the metadata:
- Update
pyriksprot/.envand set current tag. - Run the
make metadatato create a metadata database for current tag:
Due to potentiallyy breaking changes in the metadata we need to find differences between the new and old version of the metadata. If new fields or coded values have been added or change, or any other breaking change has been made then most likely the scripts that processes the metadata needs to be updated. Data updates are made both using SQL scripts and Python scripts.
-
Identify breaking changes.
- Download previous and current metadata in two seperate folders:
metadata2db download v0.9.0 ./tmp/metadata/v0.9.0 metadata2db download v0.10.0 ./tmp/metadata/v0.10.0
💡 Alt:
python pyriksprot/scripts/metadata2db.py download v0.10.0 ./tmp/metadata/v0.10.0💡 Use moshfeu.compare-folders to compare folders in vscode.
- If you find structural differences than you need to file an issue and request the system to be updated to deal with the changes. Module
pyriksprot.sqlcontains SQL scripts for metadata schema and (some) updates. Furthermore, some schema changes need to be handled in thepyriksprot.modulemodule (e.g.pyriksprot.module.config). Changes may of course also affect thepenelopecorpus pipeline.
- Download previous and current metadata in two seperate folders:
-
Create a metadata database using welfare-state-analytics/pyriksprot for given tag:
- Update
pyriksprot/.env(e.g. tag) - Run the
metadatarecipe:make metadata
- Update
- Create a default speech corpus using welfare-state-analytics/pyriksprot_tagger for given tag:
- Run te recipi
extract-speeches-to-feather:make extract-speeches-to-feather
- Run te recipi
See appendix below if you instead want to use snakemake for updating repository and tagging,
Easiest way is to clone the GitHub repository:
cd /path/to/any/folder
git clone git@github.com:welfare-state-analytics/pyriksprot_tagger
cd pyriksprot_tagger
pyenv local 3.11.3
poetry shell
pip install torch
poetry installYou can also install the tagger in an isolated Python virtual environment. This method requires you to manually download certain scripts depending on your specific workflow.
Use stanza-models.sh script to download Stanza files. Note that the target folder specified in the script must be the same as the folder specified by the STANZA_DATADIR environment variable (in .env).
Optional: Use penelope/scripts/install-spacy-models.sh to install relevant SpaCy models.
Update or create dotenv (.env) in the pyriksprot_tagger folder with the following variable definitions:
| Environment variable | Description |
|---|---|
| RIKSPROT_DATA_FOLDER | Parent folder (location) of Riksdagens corpus data folder |
| RIKSPROT_REPOSITORY_URL | https://github.com/welfare-state-analytics/riksdagen-corpus.git |
| RIKSPROT_REPOSITORY_TAG | Target corpus version. Must be a valid Github tag |
| SPARV_DATADIR | Sparv data folder |
| STANZA_DATADIR | Stanza data folder |
RIKSPROT_DATA_FOLDER="/path/to/data/folder"
RIKSPROT_REPOSITORY_URL="https://github.com/welfare-state-analytics/riksdagen-corpus.git"
RIKSPROT_REPOSITORY_TAG="vx.y.z"
SPARV_DATADIR="/path/to/sparv_datadir"
STANZA_DATADIR="/path/to/stanza_datadir"If riksdagen-corpus repository folder already exists, then do an update:
cd /path/to/git/repository
git pullIf repository folder doesn't exist:
cd /path/to/parent-folder
git clone git@github.com:welfare-state-analytics/pyriksprot_tagger.gitYou need to checkout the specific tag that you want to process:
cd /path/to/git/repository
git checkout vx.y.zMake sure to update file timestamps to latest commit timestamp!
cd /path/to/pyriksprot-tagger
./pyriksprot_tagger/scripts/update-timestampsVerify current Python version (pyenv is recommended for easy switch between versions).
Create a new Python virtual environment (sandbox):
cd /some/folder
mkdir riksprot_tagging
cd riksprot_tagging
python -m venv .venv
source .venv/bin/activateInstall the pipeline and run setup script.
pip install pyriksprot_tagger
setup-pipelineTo tag protocols you first need to activate the installed environment, and then follow steps above on how to tag protocols using snakemake.
cd /some/folder/pyriksprot
source .venv/bin/activateThis is an alternative way of updating the corpus repository.
% cd /path/to/pyriksprot-tagger/folderIf you want to create a new clone of the repository:
% make full-clone-repositoryIf you want to update an existing repository:
% make full-pull-repositoryIf you want to save space and do a shallow clone
% make shallow-update-repositoryUpdate timestamp of repository work folder files to match last commit timestamp. Important! This is required if you use Snakemake when tagging:
% make update-repository-timestampsAnnotate using default settings:
make annotateAnnotate a single year (and set cpu count).
make annotate YEAR=1960 CPU_COUNT=1Call snakemake directly:
nohup make annotate PROCESSES_COUNT=4 >& run.log &or
nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &