Co3: Controlled Content Coverage for Abstractive Legislative Summarization

Repository for Co3 legislative summarization system; it is evaluated on the BillSum corpus.

Prerequisites

Conda first-time setup

Clone repository if it does not already exist

git clone https://github.com/jessiclassy/co3.git

Download miniconda for your OS
Create a new environment:

conda create -n co3-env

Activate the environment to start developing! Yay!

conda activate co3-env

Install all the required packages:

pip install -r requirements.txt

Download the optimized spaCy English language model for evaluation

python -m spacy download en_core_web_sm

Semantic Self-Segmentation

To reproduce the semantic self-segmented data from BillSum documents as implemented in previous work, we execute metric learning and semantic self-segmentation using a submodule in the repository:

cd preprocess/se3/
git submodule init
git submodule update
condor_submit learning.cmd

Data Cleaning and reformatting

python preprocess/clean.py

This script does basic regular expression cleaning of extra whitespace and redundant headers. It takes approximately 1-2 minutes to run, before Se3 chunking.

python preprocess/reformat_se3_data.py

This script converts Se3 output (plaintext) as properly escaped CSV files for easier manipulation downstream. Takes 1-2 minutes to run.

Experiment Protocol

Phase 1: Methodological Baseline

Condor jobs for Phase 1 model finetuning and evaluation are detailed in the executables, Phase1_Finetune.cmd and Phase1_Evaluate.cmd respectively. On an NVIDIA Quadro 8000 RTX, finetuning takes 10-15 hours, while evaluation takes 10-30 hours.

Phase 2: Finetuning with Control Tokens

Condor jobs for Phase 2 model finetuning and evaluation are detailed in the executables, Phase2_Finetune.cmd and Phase2_Evaluate.cmd respectively. On an NVIDIA Quadro 8000 RTX, finetuning takes 20-40 hours, while evaluation takes 10-30 hours.

Phase 3: Blank-postprocessing

A rare edge case (<1% of test cases) resulting from Phase 2 finetuning with control tokens is that entire documents can result in blank outputs. Various postprocessing strategies to force summary generation for some $k$ chunks of a given document are implemented in this stage. A Condor job for Phase 3 summary generation and evaluation is detailed in the executable Phase3_Postprocess.cmd. On an NVIDIA Quadro 8000 RTX, postprocessing takes less than 1 hour.

Evaluation metrics are "patched" onto the new post-processed summaries:

python posthoc/patch_metric.py --file <FILENAME> --metric
 <METRICNAME> --batch_size <BATCHSIZE>

NB: the results for different post-processing strategies are not significantly different.

Name		Name	Last commit message	Last commit date
Latest commit History 494 Commits
eval		eval
output		output
posthoc		posthoc
preprocess		preprocess
run_baseline		run_baseline
.gitignore		.gitignore
.gitmodules		.gitmodules
Phase1_Evaluate.cmd		Phase1_Evaluate.cmd
Phase1_Finetune.cmd		Phase1_Finetune.cmd
Phase2_Evaluate.cmd		Phase2_Evaluate.cmd
Phase2_Finetune.cmd		Phase2_Finetune.cmd
Phase3_Postprocess.cmd		Phase3_Postprocess.cmd
README.md		README.md
evaluate_model.py		evaluate_model.py
evaluate_model.sh		evaluate_model.sh
finetune_model.py		finetune_model.py
finetune_model.sh		finetune_model.sh
generate_evaluate.sh		generate_evaluate.sh
generate_finetune.sh		generate_finetune.sh
patas_requirements.txt		patas_requirements.txt
postprocess_evaluate.py		postprocess_evaluate.py
postprocess_evaluate.sh		postprocess_evaluate.sh
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Co3: Controlled Content Coverage for Abstractive Legislative Summarization

Prerequisites

Conda first-time setup

Semantic Self-Segmentation

Data Cleaning and reformatting

Experiment Protocol

Phase 1: Methodological Baseline

Phase 2: Finetuning with Control Tokens

Phase 3: Blank-postprocessing

About

Uh oh!

Releases

Packages

Languages

jessiclassy/co3

Folders and files

Latest commit

History

Repository files navigation

Co3: Controlled Content Coverage for Abstractive Legislative Summarization

Prerequisites

Conda first-time setup

Semantic Self-Segmentation

Data Cleaning and reformatting

Experiment Protocol

Phase 1: Methodological Baseline

Phase 2: Finetuning with Control Tokens

Phase 3: Blank-postprocessing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages