An LCF-IDF Document Representation Model Applied to Long Document Classification

This repository implements the LCF-IDF (Latent Concept Frequency - Inverse Document Frequency) model, designed for vector-based representation of long documents. It also provides TF-IDF and BERT-based baselines for comparison.

For more details, refer to the paper.

Overview

LCF-IDF enhances document representation by discovering latent concepts within texts. This approach improves classification performance on long documents, where traditional bag-of-words and transformer-based models often struggle.

Concept Discovery & Translation

Usage

Parameters

Model Selection:

--tfidf: Enables TF-IDF model.
--bert: Enables a BERT-like model.
--ctfidf: Enables LCF-IDF model.

Training:

--seed: Random seed for reproducibility.
--dataset: Dataset used for evaluation.
--batch: Batch size during training.
--epochs: Number of training epochs.
--task: Task type (binary, multiclass, multilabel).
--criterion: Loss function (bce for binary cross-entropy, ce for cross-entropy).

TF-IDF Specific:

--ngrams: Number of n-grams.
--num_max_terms: Maximum vocabulary size.

BERT-like Models:

--tokenizer: Hugging Face tokenizer.
--lowercase: Apply lowercasing before tokenization.
--embedder: Hugging Face embedding model.
--finetune: Fine-tune model weights or only train classifier.
--ntokens: Maximum number of input tokens.

LCF-IDF Specific:

--dimreduction: Dimensionality reduction method.
--nclusters: Number of clusters for concept discovery.
--cluster_alg: Clustering algorithm.

Models

allenai/longformer-base-4096
bert-base-uncased
roberta-base
nlpaueb/legal-bert-base-uncased
dbmdz/bert-base-italian-uncased
dbmdz/bert-base-italian-xxl-uncased
dlicari/Italian-Legal-BERT-SC
LCF-IDF (using any of the above as the embedding backend)
TF-IDF

Datasets

hyperpartisan
newsgroups_small
ecthr_small
eurlex_small
a_512_small
scotus_small

Running Experiments (Examples)

BERT-like Model

venv/bin/python main.py --seed 2003 --dataset newsgroups_small --bert --tokenizer dbmdz/bert-base-italian-xxl-uncased --lowercase true --embedder allenai/longformer-base-4096 --finetune true --ntokens 4096 --task multiclass --criterion ce

TF-IDF

venv/bin/python main.py --seed 12345 --dataset hyperpartisan --tfidf --lowercase true --task binary --criterion bce

LCF-IDF

venv/bin/python main.py --seed 1992 --dataset eurlex_small --ctfidf --tokenizer allenai/longformer-base-4096 --lowercase true --embedder allenai/longformer-base-4096 --ntokens 4096 --task multilabel-topone --criterion ce

Custom LCF-IDF

venv/bin/python main.py --seed 12345 --dataset hyperpartisan --ctfidf --tokenizer allenai/longformer-base-4096 --lowercase true --embedder models/12345/hyperpartisan/allenai/longformer-base-4096-lcT-ftT-nt512__b8-lr3e-05-p5-bce__Vee13/embedder --ntokens 4096 --task binary --criterion bce

Results

Performance on Open Datasets

*μ (mu) denotes "small" datasets in the results.*

References

For further details, please refer to the following paper:

Principe, R. A. A., Chiarini, N., & Viviani, M. (2024, May). An LCF-IDF Document Representation Model Applied to Long Document Classification. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 1129-1135).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
src		src
confs.py		confs.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

An LCF-IDF Document Representation Model Applied to Long Document Classification

Overview

Usage

Parameters

Model Selection:

Training:

TF-IDF Specific:

BERT-like Models:

LCF-IDF Specific:

Models

Datasets

Running Experiments (Examples)

BERT-like Model

TF-IDF

LCF-IDF

Custom LCF-IDF

Results

Performance on Open Datasets

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

rAlvaPrincipe/lcf-idf

Folders and files

Latest commit

History

Repository files navigation

An LCF-IDF Document Representation Model Applied to Long Document Classification

Overview

Usage

Parameters

Model Selection:

Training:

TF-IDF Specific:

BERT-like Models:

LCF-IDF Specific:

Models

Datasets

Running Experiments (Examples)

BERT-like Model

TF-IDF

LCF-IDF

Custom LCF-IDF

Results

Performance on Open Datasets

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages