-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the POCRE wiki!
POCRE is a system for correcting text that is the output of an OCR system. OCR systems are imperfect; depending on how they're trained and the quality of the images being processed, they can make errors interpreting characters from pixels. These errors differ from human errors because they are based on visual characteristics rather than meaning or sound. The purpose of this system is to correct likely machine-introduced errors in OCR-produced text and indicate those corrections so the user can review them. A human editor will still be needed to get a completely correct text; the goal of this system is just to make that editor's job easier.
The system is based on a sequence to sequence neural network built in Tensorflow (https://www.tensorflow.org/). We provide users with a pretrained model that can be imported for testing new data. Please note, however, that the system will perform best when it has been trained on data that is similar to the data it will be used on, in such characteristics as the approximate number of errors, the language of the text, and the general content/lexicon. The data used to train our pretrained model, consisting of typewritten historical English text focusing on Egyptian archaeology, is provided so that users can decide whether the pretrained model is suitable for their needs. We used three different OCR systems to make our training data: LIOS, ABBYY, and Microsoft's Azure. The system performance may also be affected by the OCR system that produces the input data.
Preliminary evaluation results show that the system in its current state either makes no change to character and word error rates or makes them slightly worse!