Document AI

Small agentic app to extract information from documents (pdfs only so far), save a few fields to a csv file and then send the file and the detail to an email

Installation

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Model

GPT OSS with Ollama

Initialization

Initialize the database (PGVector)

python init/stores.py

Initialize the documents

Note

The folder documents must exist and it should have pdf files inside

python init/documents.py

Execution

Note

The folder csv must exist. The files generated by the agents will be saved there.
It requires docker running since it's using Redis for caching

python main.py

TODO

Testing

Added a couple of tests based on [2]

Reference:

Run tests

pytest --langsmith-output tests

Caching

Added Redis Cache
Probably investigate more about (e.g., KV Cache, or is it redis enough?)

Errors

Check the following error when initializating DeepSeek PDF

sqlalchemy.exc.DataError: (psycopg.DataError) PostgreSQL text fields cannot contain NUL (0x00) bytes
[SQL: INSERT INTO "public"."documents"("langchain_id", "content", "embedding", "langchain_metadata")VALUES (%(langchain_id)s, %(content)s, %(embedding)s, %(extra)s) ON CONFLICT ("langchain_id") DO UPDATE SET "content" = EXCLUDED."content", "embedding" = EXCLUDED."embedding", "langchain_metadata" = EXCLUDED."langchain_metadata";]
[parameters: {'langchain_id': '50e5e4f4-d8d0-4f47-bf4d-9d73200fa5f9', 'content': 'mechanismretrieves only the key-value entries {c𝑠}corresponding to the top-k index scores.\nThen, the attention outputu 𝑡 is ...

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
agents		agents
init		init
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
consts.py		consts.py
main.py		main.py
models.py		models.py
prompts.py		prompts.py
requirements.txt		requirements.txt
schemas.py		schemas.py
stores.py		stores.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document AI

Installation

Model

Initialization

Initialize the database (PGVector)

Initialize the documents

Execution

TODO

Testing

Run tests

Caching

Errors

About

Uh oh!

Releases

Packages

Languages

License

ericmartinezr/document_ai

Folders and files

Latest commit

History

Repository files navigation

Document AI

Installation

Model

Initialization

Initialize the database (PGVector)

Initialize the documents

Execution

TODO

Testing

Run tests

Caching

Errors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages