doc-sum

This repository contains a divide and conquer lang-chain script for summarizing large text-corpora with LLMs. The script iteratively crawls through a directory of pdfs and feeds them into an instance of a GPT along with a query objective which is defined by you. The resulting excerpts are buffered in a dictionary and finally summed in a response summary. The response is saved in a separate file in the response directory.

Requirements

All requirements are saved in the venv.yml file. You can install the python environemnt by typing:

mamba env create -f venv.yml

The main dependency is the langchain library.

Usage

Warning

This script only works, if you acquired tokens from OpenAI. Every use of the script will use tokens from your project's OpenAI account.

Install the mamba environment.
Save the documents of your corpus as individual documents into the data directory.

You can run the script either from a jupyter notebook or from the command line.

When running the script from the command line:

Open your terminal and navigate to the code directory.
activate your mamba environment and type:

python  main.py --query '<YOUR-QUERY>'

You can find your answer as separate file in the responses directory. The file is named after your query objective.
If you need help, type:

python main.py --help

When running the script from the jupyter notebook:

Open your terminal and activate your mamba environment.
Start your jupyter server, adapt the information in the cells
Adapt your queries in the query cells.
Run your pipeline cell.
You will get the answer as prompt below your cell. Additionally, your answer as separate file in the responses directory. The file is named after your query objective.

License

The code is licensed under MIT license. Refer to the LICENSE file for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

doc-sum

Requirements

Usage

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
code		code
data		data
responses		responses
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
venv.yml		venv.yml

License

j-ehrhardt/doc-sum

Folders and files

Latest commit

History

Repository files navigation

doc-sum

Requirements

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages