Skip to content

swan-07/authorship-verification

Repository files navigation

authorship-verification


Updates (January 2026)

Intro Video

New additions to this repository:

  • Reorganized Structure: Files organized into folders (app/, models/, training/, testing/, analysis/, logs/, visualizations/, data/)
  • New Models: Added simplified Stylometric + BERT + Ensemble models with improved performance
    • training/train_stylometric_pan.py - Train stylometric model
    • training/finetune_bert_v2.py - Fine-tune BERT
    • training/create_ensemble.py - Create ensemble
    • testing/test_ensemble_detailed.py - Evaluate models
  • Streamlit App: Interactive demo with interpretability in app/ folder
  • Documentation: Added DIRECTORY_STRUCTURE.md, CHANGELOG.md, and app/README.md

Performance Comparison:

Model Accuracy F1 Score AUC
Original Models (from paper):
Fine-tuned BERT 52.4% 68.8% N/A
Feature Vector 62.6% 65.3% 0.646
Original Models (recreated from paper):
Base BERT + Calibration 63.7% 58.2% 0.676
Fine-tuned BERT + Calibration 70.1% 71.6% 0.760
Feature Vector 58.6% 57.9% 0.619
New Models (Jan 2026):
BERT (fine-tuned, simplified) 73.9% 73.8% 0.821
Stylometric (PAN-style) 62.2% 57.1% 0.665
Ensemble (BERT + Stylometric) 73.9% 73.8% 0.823

Original code from the paper is preserved in featurevector/ and siamesebert/ folders.

See CHANGELOG.md for detailed changes.


Code for authorship verification research project.

Website code can be found at https://github.com/swan-07/streamlit-av

Website hosted at https://same-writer-detector.streamlit.app/

Paper can be found at https://swan-07.github.io/assets/Transparent%20Authorship%20Verification.pdf

Code detailed in this repo was run in Jupyter Notebooks, and model scripts (everything in the Models section below) was run on RunPod with an A100 SXM.

Pipeline:

pipeline

Slides: https://docs.google.com/presentation/d/1zG6BA4hjz4E12kYroUOce2GK8M6MuL8UAWxcHkyKqaw/edit?usp=sharing

Models

https://github.com/swan-07/authorship-verification/tree/main/featurevector has code for implementing the Feature Vector model, modified from the implementation in https://github.com/janithnw/pan2021_authorship_verification/tree/main.

Run preprocess.ipynb to preprocess the data (takes a LONG time, I split it into multiple chunks to run at once and combined them in combine.ipynb).

Run large_train_model.ipynb to create feature vectors and train the model.

Use large_predict.ipynb for predictions and getting important features.

https://github.com/swan-07/authorship-verification/tree/main/siamesebert/methods has code for implementing the Embedding model.

Run bert.ipynb to train the BERT model (based off the implementation in https://github.com/JacobTyo/Valla/tree/main).

Run logreg.ipynb to fit a logistic regression model to calibrate probability predictions based off of cosine similarity of the embeddings as well as do attention-based highlighting and predictions.

Datasets

Dataset curated can be found on huggingface at https://huggingface.co/datasets/swan07/authorship-verification

Code for cleaning and modifying datasets can be found in https://github.com/swan-07/authorship-verification/blob/main/Authorship_Verification_Datasets.ipynb and is detailed in paper.

Datasets used to produce the final dataset are:

  1. Reuters50

@misc{misc_reuter_50_50_217, author = {Liu,Zhi}, title = {{Reuter_50_50}}, year = {2011}, howpublished = {UCI Machine Learning Repository}, note = {{DOI}: https://doi.org/10.24432/C5DS42} }

License: (CC BY 4.0)

  1. The Blog Authorship Corpus

@misc{misc_blog_authorship_corpus, author = {J. Schler, M. Koppel, S. Argamon and J. Pennebaker}, title = {{Effects of Age and Gender on Blogging}}, year = {2006}, howpublished = {2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs}, note = {https://u.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf} }

License from https://www.kaggle.com/datasets/rtatman/blog-authorship-corpus: The corpus may be freely used for non-commercial research purposes.

  1. Victorian

@misc{misc_victorian_era_authorship_attribution_454, author = {Gungor,Abdulmecit}, title = {{Victorian Era Authorship Attribution}}, year = {2018}, howpublished = {UCI Machine Learning Repository}, note = {{DOI}: https://doi.org/10.24432/C5SW4H} }

License: (CC BY 4.0)

  1. arXiv

@misc{misc_arXiv_100authors_comp_sci, author = {Moreo, Alejandro}, title = {{arXiv abstracts and titles from 1,469 single-authored papers (100 unique authors) in computer science }}, year = {2022}, howpublished = {Zenodo}, note = {{DOI}: https://doi.org/10.5281/zenodo.7404702} }

License: (CC BY 4.0)

  1. DarkReddit

@article{DBLP:journals/corr/abs-2112-05125, author = {Andrei Manolache and Florin Brad and Elena Burceanu and Antonio Barbalau and Radu Tudor Ionescu and Marius Popescu}, title = {Transferring BERT-like Transformers' Knowledge for Authorship Verification}, journal = {CoRR}, volume = {abs/2112.05125}, year = {2021}, url = {https://arxiv.org/abs/2112.05125}, eprinttype = {arXiv}, eprint = {2112.05125}, timestamp = {Mon, 13 Dec 2021 17:51:48 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2112-05125.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

@inproceedings{Kestemont2020OverviewOT, author = {Mike Kestemont and Enrique Manjavacas and Ilia Markov and Janek Bevendorff and Matti Wiegmann and Efstathios Stamatatos and Martin Potthast and Benno Stein}, editor = {Linda Cappellato and Carsten Eickhoff and Nicola Ferro and Aur{'{e}}lie N{'{e}}v{'{e}}ol}, title = {Overview of the Cross-Domain Authorship Verification Task at {PAN} 2020}, booktitle = {Working Notes of {CLEF} 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020}, series = {{CEUR} Workshop Proceedings}, volume = {2696}, publisher = {CEUR-WS.org}, year = {2020}, url = {http://ceur-ws.org/Vol-2696/paper\_264.pdf}, timestamp = {Tue, 27 Oct 2020 17:12:48 +0100}, biburl = {https://dblp.org/rec/conf/clef/KestemontMMBWSP20.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

License from https://github.com/bit-ml/Dupin/tree/main: not disclosed

  1. British Academic Written English (BAWE)

@misc{20.500.12024/2539, title = {British Academic Written English Corpus}, author = {Nesi, Hilary and Gardner, Sheena and Thompson, Paul and Wickens, Paul}, url = {http://hdl.handle.net/20.500.12024/2539}, note = {Oxford Text Archive}, copyright = {Distributed by the University of Oxford under a Creative Commons Attribution-{NonCommercial}-{ShareAlike} 3.0 Unported License.}, year = {2008} }

License from https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2539: (CC BY-NC-SA 3.0)

  1. IMDB62

@article{seroussi2014authorship, title={Authorship attribution with topic models}, author={Seroussi, Yanir and Zukerman, Ingrid and Bohnert, Fabian}, journal={Computational Linguistics}, volume={40}, number={2}, pages={269--310}, year={2014}, publisher={MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…} }

License from https://umlt.infotech.monash.edu/?page_id=266: not disclosed

  1. PAN11

@misc{misc_pan11-author-identification-corpora, author = {Argamon, Shlomo and Juola, Patrick}, title = {{PAN11 Author Identification: Attribution}}, year = {2011}, howpublished = {Zenodo}, note = {{DOI}: https://doi.org/10.5281/zenodo.3713245} }

License: not disclosed

  1. PAN13

@misc{misc_pan13-authorship-verification-test-and-training, author = {Juola, Patrick and Stamatatos, Efstathios}, title = {{PAN13 Author Identification: Verification}}, year = {2013}, howpublished = {Zenodo}, note = {{DOI}: https://doi.org/10.5281/zenodo.3715998} }

License: not disclosed

  1. PAN14

@misc{misc_pan14-authorship-verification-test-and-training, author = {Stamatatos, Efstathios and Daelemans, Walter and Verhoeven, Ben and Potthast, Martin and Stein, Benno and Juola, Patrick and A. Sanchez-Perez, Miguel and Barrón-Cedeño, Alberto}, title = {{PAN14 Author Identification: Verification}}, year = {2014}, howpublished = {Zenodo}, note = {{DOI}: https://doi.org/10.5281/zenodo.3716032} }

License: not disclosed

  1. PAN15

@misc{misc_pan15-authorship-verification-test-and-training, author = {Stamatatos, Efstathios and Daelemans Daelemans amd Ben Verhoeven, Walter and Juola, Patrick and López-López, Aurelio and Potthast, Martin and Stein, Benno}, title = {{PAN15 Author Identification: Verification}}, year = {2015}, howpublished = {Zenodo}, note = {{DOI}: https://doi.org/10.5281/zenodo.3737563} }

License: not disclosed

  1. PAN20

@Article{stein:2020k, author = {Sebastian Bischoff and Niklas Deckers and Marcel Schliebs and Ben Thies and Matthias Hagen and Efstathios Stamatatos and Benno Stein and Martin Potthast}, journal = {CoRR}, month = may, title = {{The Importance of Suppressing Domain Style in Authorship Analysis}}, url = {https://arxiv.org/abs/2005.14714}, volume = {abs/2005.14714}, year = 2020 }

using the open-set unseen all split from @article{DBLP:journals/corr/abs-2112-05125, author = {Andrei Manolache and Florin Brad and Elena Burceanu and Antonio Barbalau and Radu Tudor Ionescu and Marius Popescu}, title = {Transferring BERT-like Transformers' Knowledge for Authorship Verification}, journal = {CoRR}, volume = {abs/2112.05125}, year = {2021}, url = {https://arxiv.org/abs/2112.05125}, eprinttype = {arXiv}, eprint = {2112.05125}, timestamp = {Mon, 13 Dec 2021 17:51:48 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2112-05125.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

@inproceedings{Kestemont2020OverviewOT, author = {Mike Kestemont and Enrique Manjavacas and Ilia Markov and Janek Bevendorff and Matti Wiegmann and Efstathios Stamatatos and Martin Potthast and Benno Stein}, editor = {Linda Cappellato and Carsten Eickhoff and Nicola Ferro and Aur{'{e}}lie N{'{e}}v{'{e}}ol}, title = {Overview of the Cross-Domain Authorship Verification Task at {PAN} 2020}, booktitle = {Working Notes of {CLEF} 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020}, series = {{CEUR} Workshop Proceedings}, volume = {2696}, publisher = {CEUR-WS.org}, year = {2020}, url = {http://ceur-ws.org/Vol-2696/paper\_264.pdf}, timestamp = {Tue, 27 Oct 2020 17:12:48 +0100}, biburl = {https://dblp.org/rec/conf/clef/KestemontMMBWSP20.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

License from https://github.com/bit-ml/Dupin/tree/main: not disclosed

Datasets were cleaned, named entities were replaced with their general type in all except PAN14, PAN15, and PAN20, and datasets were restructured into dataframes with columns |text1|text2|same| where a value of 0 in same meant the two texts had different authors, while a value of 1 meant the two texts had the same author.

All datasets were split into train/test/verification, keeping the splits if given (see paper for specifics) and otherwise using a 0.7:0.15:0.15 split.

About

Code for authorship verification research project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors