Riksdagens Protokoll Part-Of-Speech Tagging

This package implements part-of-speech tagging of Riksdagens Protokoll (Parla-CLARIN)[https://clarin-eric.github.io/parla-clarin/] files.

Prerequisites

The workflow makes use of Gnu make, git, pyenv and poetry.
Latest version of welfare-state-analytics/pyriksprot.
Latest version of welfare-state-analytics/pyriksprot_tagger.
NLP models for Sparv and Stanza installed.
A local copy of riksdagen-corpus Github repository.

Tagging Workflow

Update pyriksprot-tagger configuration file (.env).
Update riksdagen-corpus repository.

Run the tag.sh script:

PYTHONPATH=. nohup ./tag.sh --target-folder /path/to/output/data > tag-it.version.log &

You can also execute a predefined make recepi:

make tag-it

If you run tag.sh without parameters then the values found in .env will be used. You can also specify parameters as command line options:

usage: ./tag.sh [--data-folder folder] [--source-pattern pattern] --target-folder folder --tag tag [--force] [--update] [--max-procs n]]
Creates new database using source as template. Source defaults to production.

   --data-folder             source root folder
   --source-pattern          source folder pattern
   --target-folder           target folder
   --tag                     source corpus tag
   --force                   drop target if exists
   --update                  update target if exists
   --max-procs               max number of parallel jobs

Note that tag.sh will raise an error if the checkout tag in the Git repository and tag specified in .env (or as a parameter) mismatch.

Metadata Workflow

This workflow processes the corpus metadata and generates an Sqlite relational database. This database is used by the Westac Notebooks when filtering and pivoting data based on speaker, party etc. Use welfare-state-analytics/pyriksprot to create or update the metadata:

Update pyriksprot/.env and set current tag.
Run the make metadata to create a metadata database for current tag:

Detailed workflow

Due to potentiallyy breaking changes in the metadata we need to find differences between the new and old version of the metadata. If new fields or coded values have been added or change, or any other breaking change has been made then most likely the scripts that processes the metadata needs to be updated. Data updates are made both using SQL scripts and Python scripts.

Identify breaking changes.
- Download previous and current metadata in two seperate folders:
```
metadata2db download v0.9.0 ./tmp/metadata/v0.9.0
metadata2db download v0.10.0 ./tmp/metadata/v0.10.0
```
💡 Alt: python pyriksprot/scripts/metadata2db.py download v0.10.0 ./tmp/metadata/v0.10.0

💡 Use moshfeu.compare-folders to compare folders in vscode.
- If you find structural differences than you need to file an issue and request the system to be updated to deal with the changes. Module pyriksprot.sql contains SQL scripts for metadata schema and (some) updates. Furthermore, some schema changes need to be handled in the pyriksprot.module module (e.g. pyriksprot.module.config). Changes may of course also affect the penelope corpus pipeline.
Create a metadata database using welfare-state-analytics/pyriksprot for given tag:
- Update pyriksprot/.env (e.g. tag)
- Run the metadata recipe:
```
make metadata
```

Speech Corpus Workflow

Create a default speech corpus using welfare-state-analytics/pyriksprot_tagger for given tag:
- Run te recipi extract-speeches-to-feather:
```
make extract-speeches-to-feather
```

See appendix below if you instead want to use snakemake for updating repository and tagging,

Install pyriksprot tagger

Easiest way is to clone the GitHub repository:

cd /path/to/any/folder
git clone git@github.com:welfare-state-analytics/pyriksprot_tagger
cd pyriksprot_tagger
pyenv local 3.11.3
poetry shell
pip install torch
poetry install

You can also install the tagger in an isolated Python virtual environment. This method requires you to manually download certain scripts depending on your specific workflow.

Install Sparv and Stanza models

Use stanza-models.sh script to download Stanza files. Note that the target folder specified in the script must be the same as the folder specified by the STANZA_DATADIR environment variable (in .env).

Optional: Use penelope/scripts/install-spacy-models.sh to install relevant SpaCy models.

Update configuration

Update or create dotenv (.env) in the pyriksprot_tagger folder with the following variable definitions:

Environment variable	Description
RIKSPROT_DATA_FOLDER	Parent folder (location) of Riksdagens corpus data folder
RIKSPROT_REPOSITORY_URL	https://github.com/welfare-state-analytics/riksdagen-corpus.git
RIKSPROT_REPOSITORY_TAG	Target corpus version. Must be a valid Github tag
SPARV_DATADIR	Sparv data folder
STANZA_DATADIR	Stanza data folder

RIKSPROT_DATA_FOLDER="/path/to/data/folder"
RIKSPROT_REPOSITORY_URL="https://github.com/welfare-state-analytics/riksdagen-corpus.git"
RIKSPROT_REPOSITORY_TAG="vx.y.z"
SPARV_DATADIR="/path/to/sparv_datadir"
STANZA_DATADIR="/path/to/stanza_datadir"

Appendix

Setup a local copy of riksdagen-corpus Github repository

If riksdagen-corpus repository folder already exists, then do an update:

cd /path/to/git/repository
git pull

If repository folder doesn't exist:

cd /path/to/parent-folder
git clone git@github.com:welfare-state-analytics/pyriksprot_tagger.git

You need to checkout the specific tag that you want to process:

cd /path/to/git/repository
git checkout vx.y.z

Make sure to update file timestamps to latest commit timestamp!

cd /path/to/pyriksprot-tagger
./pyriksprot_tagger/scripts/update-timestamps

Install pyriksprot-tagger from PyPI

Verify current Python version (pyenv is recommended for easy switch between versions).

Create a new Python virtual environment (sandbox):

cd /some/folder
mkdir riksprot_tagging
cd riksprot_tagging
python -m venv .venv
source .venv/bin/activate

Install the pipeline and run setup script.

pip install pyriksprot_tagger
setup-pipeline

To tag protocols you first need to activate the installed environment, and then follow steps above on how to tag protocols using snakemake.

cd /some/folder/pyriksprot
source .venv/bin/activate

Create or update the repository using snakemake (not recommended)

This is an alternative way of updating the corpus repository.

% cd /path/to/pyriksprot-tagger/folder

If you want to create a new clone of the repository:

% make full-clone-repository

If you want to update an existing repository:

% make full-pull-repository

If you want to save space and do a shallow clone

% make shallow-update-repository

Update timestamp of repository work folder files to match last commit timestamp. Important! This is required if you use Snakemake when tagging:

% make update-repository-timestamps

How to annotate protocols using snakemake (not recommended)

Annotate using default settings:

make annotate

Annotate a single year (and set cpu count).

make annotate YEAR=1960 CPU_COUNT=1

Call snakemake directly:

nohup make annotate PROCESSES_COUNT=4 >& run.log &

or

nohup poetry run snakemake --config -j4 --keep-going --keep-target-files &

Name		Name	Last commit message	Last commit date
Latest commit History 667 Commits
configs		configs
purgatory		purgatory
pyriksprot_tagger		pyriksprot_tagger
tests		tests
.editorconfig		.editorconfig
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
Makefile.dev		Makefile.dev
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tag.sh		tag.sh
tag_test_data.sh		tag_test_data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Riksdagens Protokoll Part-Of-Speech Tagging

Prerequisites

Tagging Workflow

Metadata Workflow

Detailed workflow

Speech Corpus Workflow

Install pyriksprot tagger

Install Sparv and Stanza models

Update configuration

Appendix

Setup a local copy of riksdagen-corpus Github repository

Install pyriksprot-tagger from PyPI

Create or update the repository using snakemake (not recommended)

How to annotate protocols using snakemake (not recommended)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Riksdagens Protokoll Part-Of-Speech Tagging

Prerequisites

Tagging Workflow

Metadata Workflow

Detailed workflow

Speech Corpus Workflow

Install pyriksprot tagger

Install Sparv and Stanza models

Update configuration

Appendix

Setup a local copy of riksdagen-corpus Github repository

Install pyriksprot-tagger from PyPI

Create or update the repository using snakemake (not recommended)

How to annotate protocols using snakemake (not recommended)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages