Skip to content
forked from AnanthaR20/ling573

Co3: Controlled Content Coverage for Legislative Summarization

Notifications You must be signed in to change notification settings

jessiclassy/co3

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

494 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Co3: Controlled Content Coverage for Abstractive Legislative Summarization

Repository for Co3 legislative summarization system; it is evaluated on the BillSum corpus.

Prerequisites

Conda first-time setup

  1. Clone repository if it does not already exist
git clone https://github.com/jessiclassy/co3.git
  1. Download miniconda for your OS

  2. Create a new environment:

conda create -n co3-env
  1. Activate the environment to start developing! Yay!
conda activate co3-env
  1. Install all the required packages:
pip install -r requirements.txt
  1. Download the optimized spaCy English language model for evaluation
python -m spacy download en_core_web_sm

Semantic Self-Segmentation

To reproduce the semantic self-segmented data from BillSum documents as implemented in previous work, we execute metric learning and semantic self-segmentation using a submodule in the repository:

cd preprocess/se3/
git submodule init
git submodule update
condor_submit learning.cmd

Data Cleaning and reformatting

python preprocess/clean.py

This script does basic regular expression cleaning of extra whitespace and redundant headers. It takes approximately 1-2 minutes to run, before Se3 chunking.

python preprocess/reformat_se3_data.py

This script converts Se3 output (plaintext) as properly escaped CSV files for easier manipulation downstream. Takes 1-2 minutes to run.

Experiment Protocol

Phase 1: Methodological Baseline

Condor jobs for Phase 1 model finetuning and evaluation are detailed in the executables, Phase1_Finetune.cmd and Phase1_Evaluate.cmd respectively. On an NVIDIA Quadro 8000 RTX, finetuning takes 10-15 hours, while evaluation takes 10-30 hours.

Phase 2: Finetuning with Control Tokens

Condor jobs for Phase 2 model finetuning and evaluation are detailed in the executables, Phase2_Finetune.cmd and Phase2_Evaluate.cmd respectively. On an NVIDIA Quadro 8000 RTX, finetuning takes 20-40 hours, while evaluation takes 10-30 hours.

Phase 3: Blank-postprocessing

A rare edge case (<1% of test cases) resulting from Phase 2 finetuning with control tokens is that entire documents can result in blank outputs. Various postprocessing strategies to force summary generation for some $k$ chunks of a given document are implemented in this stage. A Condor job for Phase 3 summary generation and evaluation is detailed in the executable Phase3_Postprocess.cmd. On an NVIDIA Quadro 8000 RTX, postprocessing takes less than 1 hour.

Evaluation metrics are "patched" onto the new post-processed summaries:

python posthoc/patch_metric.py --file <FILENAME> --metric
 <METRICNAME> --batch_size <BATCHSIZE>

NB: the results for different post-processing strategies are not significantly different.

About

Co3: Controlled Content Coverage for Legislative Summarization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 46.3%
  • Python 38.1%
  • Batchfile 7.5%
  • R 5.9%
  • Shell 2.2%