Repository for Co3 legislative summarization system; it is evaluated on the BillSum corpus.
- Clone repository if it does not already exist
git clone https://github.com/jessiclassy/co3.git-
Download miniconda for your OS
-
Create a new environment:
conda create -n co3-env- Activate the environment to start developing! Yay!
conda activate co3-env- Install all the required packages:
pip install -r requirements.txt- Download the optimized spaCy English language model for evaluation
python -m spacy download en_core_web_smTo reproduce the semantic self-segmented data from BillSum documents as implemented in previous work, we execute metric learning and semantic self-segmentation using a submodule in the repository:
cd preprocess/se3/
git submodule init
git submodule update
condor_submit learning.cmdpython preprocess/clean.pyThis script does basic regular expression cleaning of extra whitespace and redundant headers. It takes approximately 1-2 minutes to run, before Se3 chunking.
python preprocess/reformat_se3_data.pyThis script converts Se3 output (plaintext) as properly escaped CSV files for easier manipulation downstream. Takes 1-2 minutes to run.
Condor jobs for Phase 1 model finetuning and evaluation are detailed in the
executables, Phase1_Finetune.cmd and Phase1_Evaluate.cmd respectively. On an NVIDIA Quadro 8000 RTX, finetuning takes 10-15 hours, while evaluation takes 10-30 hours.
Condor jobs for Phase 2 model finetuning and evaluation are detailed in the
executables, Phase2_Finetune.cmd and Phase2_Evaluate.cmd respectively. On an NVIDIA Quadro 8000 RTX, finetuning takes 20-40 hours, while evaluation takes 10-30 hours.
A rare edge case (<1% of test cases) resulting from Phase 2 finetuning with control tokens is that entire documents can result in blank outputs. Various postprocessing strategies to force summary generation for some Phase3_Postprocess.cmd. On an NVIDIA Quadro 8000 RTX, postprocessing takes less than 1 hour.
Evaluation metrics are "patched" onto the new post-processed summaries:
python posthoc/patch_metric.py --file <FILENAME> --metric
<METRICNAME> --batch_size <BATCHSIZE>NB: the results for different post-processing strategies are not significantly different.