Publication Date Extractor

University project by Hanna Brinkmann, Zofia Milczarek, Alexandre Nechab, Joanna Radola

The aim of this project is to propose a pipeline that determines the publication date of a given document.

A full description of the project is available in the extract_publication_date_documentation.pdf file.

Approach

We use a Llama 3.2-Instruct 8B Model in 4bit quantization that was fine-tuned on the date extraction task to determine the publication date. The model output has consistently the following format:

{"publication date": "DD/MM/YYYY"}

For our test data set we computed the accuracy for perfect matches, month and year matches and only year matches:

Match	Accuracy
Day, Month and year	71%
Month and Year	82%
Year	93%

Before settling on this approach, we tried different, non fine-tuned models. The code can still be found above.

Data

Our corpus consisted of 500 official documents in French, created by cities, municipalities and courts. For each document we manually annotated the publication date to have a gold standard. As not all documents were accessible, we excluded those with invalid URL. We then performed a 70/30 split in order to obtain a train and a test set. The model was fine-tuned with the first and last 3000 characters of each document in the train set.

How to use

First, you need to clone this repository and create an environment using the method of your choice (venv, pipenv, conda).

Then, install the necessary libraries:

$ pip install -r requirements.txt

In order to use the pipeline with our data, the data has to be in the right format. That can be done by running the following command. Only input and output path are mandatory arguments and possible output types are csv or pickle (pkl) files.

$ python preprocessing_finetune.py \
--input-path <original_dataset_path>
--output-path <path_for_finetuned_model> \
--out-type csv \
--length 3000 \
--no-split

Once the data is in the right format, the inference script can be called by:

$ python finetuned_llama_inference.py \ 
  --input-path <dataset_path>
  --output-path <path_for_results> \
  --model-checkpoint <model_checkpoint> \
  --gold-labels <column_name_of_gold_lables>

The predicted dates will be returned in a new column in the passed dataframe. If the --gold-labels flag is passed, the script also performs evaluation and returns accuracy for perfect matches, for matches in month and year and for year matches.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
df_for_finetuning		df_for_finetuning
.gitignore		.gitignore
README.md		README.md
eval.ipynb		eval.ipynb
extract_publication_date_documentation.pdf		extract_publication_date_documentation.pdf
finetuned_llama_inference.py		finetuned_llama_inference.py
huggingface_llama.ipynb		huggingface_llama.ipynb
llama_finetune.ipynb		llama_finetune.ipynb
ollama.ipynb		ollama.ipynb
original_dataset.csv		original_dataset.csv
preprocessing_VLM.py		preprocessing_VLM.py
preprocessing_finetune.py		preprocessing_finetune.py
regex.py		regex.py
regex_extraction.py		regex_extraction.py
requirements.txt		requirements.txt
smolVLM.ipynb		smolVLM.ipynb
test_extraction.py		test_extraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Publication Date Extractor

Approach

Data

How to use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Alexanthos/publication-date-extractor

Folders and files

Latest commit

History

Repository files navigation

Publication Date Extractor

Approach

Data

How to use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages