contextual-extractor

Idea

Create a formula able to perfectly deconstruct a paper into its varied components and create a optimized way for llms to digest and depict what occurs within graphs and visual representations such as charts and tables.

Why

When gathering information from the web for the purposes of creating an agent we need to scrape a lot of data. A huge part of the web is the visual data which provides a large amount of context. We cant use OCR as we are just gathering the text from the image and not truly understand what is occuring inside of the image. How do we combat this?

We have access to llms now that are capable of understanding images but we need to understand what the best way to feed this data to an llm is.

Solution

Following these steps we can create accurate descriptions of images and allow for automation within our agent, saving time and resources.

Create slices of images that contain everything, the text and the visual data.
Catagorize the slices into relevant groups, such as text or visual.
Reconstruct the slices in order to create images that properly sort out text from visual data.
Decide from a monodirectional or bidirectional approach to feeding in the context into the LLM and providing the image as visual data.
The LLM in theory will be better able to dipict what is occuring in the image. Providing a more accurate description that can be used in RAG.

What makes it difficult to just extract iamges from papers or web pages?

Images can be uploaded in various different formats
There is no clear way to know what is a image and what is text
Graphs and tables which contain lots of text can be misinterpreted as just text instead of visual data

This provides a much more coherent way to extract visuals from a page.

The process (simplified)

~~Train a CNN model to classify the slices as text or visual~~ Originally I was training a model but after testing and seeing how much better it was to just fine tune the MobileNetV2 model I decided to change my approach
Slice a pdf into indivudal peices which we can then feed into the CNN model
After labeling the slices we can then reconstruct the slices into relevant groups
Using a LLM in a mono or bidirectional approach to provide context to the image

Proof of concept progress

Able to seperate a pdf into individual slices
[ ] Training CNN model to classify slices as text or visual (Currently seperating data and training the model)
Fine tuning the MobileNetV2 model to classify slices as text or visual
Reconstructing the slices into relevant groups

Estimate tuning size

~350 slices per class currently
500 pure slices per class is the future goal

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
CNN		CNN
convert		convert
result		result
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
Diagram.png		Diagram.png
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

contextual-extractor

Idea

Why

Solution

The process (simplified)

Proof of concept progress

Estimate tuning size

Diagram Explanation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

JarettForzano/contextual-extractor

Folders and files

Latest commit

History

Repository files navigation

contextual-extractor

Idea

Why

Solution

The process (simplified)

Proof of concept progress

Estimate tuning size

Diagram Explanation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages