Find a legend competition

Repository for competition "Find a legend" by xeek.ai (https://xeek.ai/challenges/extract-crossplot-markers).

Problem Description

Througout the scientific community, a vast amount of information is contained within figures in papers, reports, and books. Without the raw data, this information can be lost altogether. We can increase our collective knowledge as a community if we develop a way to extract this information and convert it to a useful format for agregation and downstream analysis.

The goal of this challenge is to be able to extract the plot elements from the legend into a datatable. Elements in the legend will be listed in the order they appear on the legend and will be separated by a space.

Example: ['Type A' 'Type B' 'Type C']

Data Description

Image files containing one graph per file.
CSV file containing the image file name and legend elements. These labels are to be used to train and test the model on the associated graphs.

Solution Approach

To solve this task, the following approach is taken:

First, a YOLOv5s model is trained on generated data. These data can be found here. For data generation, this script was used. Compared to the original data provided by the competition hosts, the generated data labels also contain the position of the legend box. The YOLOv5s model can now be created by first cloning the YOLOv5 repository. The directory yolov5 of this repository, however, already corresponds to a clone of that very repository (with some files not necessary in this case deleted) and can be used to train a YOLOv5s model on our generated data (tutorial). The weights of our trained model can be found [here](models_detection]. Loading these weights leads to a model that can be used to detect legends and their positions in a plot.

After applying the described model to the data to predict the position of the legend, the legends are cropped from the images according to the predictions of the model.

Then a pretrained PyTesseract model, used for Optical Character Recognition (OCR), is applied to the cropped image, returning the text in the legend. Afterwards, the results of this OCR model are post-processed. Optionally, some pre-processing methods provided in a separate pre-processor class can be applied before executing the OCR. In case more than one legend is (mistakenly) found by the YOLOv5 detection model, the OCR model is applied on all possible legends and the results are concatenated (usually, the OCR model will find no text if there is no legend). The output of the OCR model corresponds to the text within the legend in the form shown in the grafic above.

Repository Contents

generated_data: data generation script and data generated by this very script
misc: miscellaneous files such as notebooks for test purposes, images, etc.
models_detection: legend detection models, namely the YOLOv5s model
models_ocr: ocr models, namely the PyTesseract model, and pre-processing methods
raw_data: data provided by the organisers of the competition
results: different results for the test data of the competition
runs: results of the runs of the YOLOv5 model
yolov5: cloned yolov5 repository (some files not necessary in this case deleted)
plot_legend_detection.ipynb: autonomous notebook to create a submission for the competition by using the model architecture above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Find a legend competition

Problem Description

Data Description

Solution Approach

Repository Contents

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
generated_data		generated_data
misc		misc
models_detection		models_detection
models_ocr		models_ocr
raw_data		raw_data
results		results
README.md		README.md
plot_legend_detection.ipynb		plot_legend_detection.ipynb
requirements.txt		requirements.txt

REDA-solutions/PlotLegendDetectionCV

Folders and files

Latest commit

History

Repository files navigation

Find a legend competition

Problem Description

Data Description

Solution Approach

Repository Contents

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages