Skip to content

This repository contains the implementation of a PyTorch-based text2text-model for normalizing orthographic variations in medieval Latin texts. The model is trained on the Normalized Georges 1913 Dataset and uses Hugging Facefor easy model and vocabulary management.

Notifications You must be signed in to change notification settings

michaelscho/georges-1913

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Medieval Latin Normalization Model based on Georges 1913

This repository contains the implementation of a PyTorch-based text2text model with attention for normalizing orthographic variations in medieval Latin texts.

The model is trained on the Normalized Georges 1913 Dataset and leverages Hugging Face's ecosystem for easy model and vocabulary management.

Contents

  • train_model.py: Script for training the normalization model.
    • Includes dynamic loading of the dataset and vocabulary.
    • Trains a Seq2Seq model with an attention mechanism and saves the model and vocabulary for later use.
  • test_model.py: Script for testing the normalization model.
    • Loads the trained model, vocabulary, and configuration from a Hugging Face repository.
    • Normalizes test words from an input file (test_normalisation.txt).

Usage

  1. Train the Model:

    • Modify train_model.py as needed for your dataset.
    • Run:
      python train_model.py
    • Saves:
      • Model: normalization_model.pth
      • Vocabulary: vocab.pkl
      • Config: config.json
  2. Test the Model:

Acknowledgments

Dataset and model were created by Michael Schonhardt (https://orcid.org/0000-0002-2750-1900) for the project Burchards Dekret Digital.

Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via www.zeno.org by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.

License

CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/legalcode.en)

About

This repository contains the implementation of a PyTorch-based text2text-model for normalizing orthographic variations in medieval Latin texts. The model is trained on the Normalized Georges 1913 Dataset and uses Hugging Facefor easy model and vocabulary management.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages