This repository contains a divide and conquer lang-chain script for summarizing large text-corpora with LLMs.
The script iteratively crawls through a directory of pdfs and feeds them into an instance of a GPT along with a query objective which is defined by you.
The resulting excerpts are buffered in a dictionary and finally summed in a response summary.
The response is saved in a separate file in the response directory.
All requirements are saved in the venv.yml file.
You can install the python environemnt by typing:
mamba env create -f venv.yml The main dependency is the langchain library.
Warning
This script only works, if you acquired tokens from OpenAI. Every use of the script will use tokens from your project's OpenAI account.
- Install the mamba environment.
- Save the documents of your corpus as individual documents into the
datadirectory.
You can run the script either from a jupyter notebook or from the command line.
When running the script from the command line:
- Open your terminal and navigate to the
codedirectory. - activate your mamba environment and type:
python main.py --query '<YOUR-QUERY>'- You can find your answer as separate file in the
responsesdirectory. The file is named after your query objective. - If you need help, type:
python main.py --helpWhen running the script from the jupyter notebook:
- Open your terminal and activate your mamba environment.
- Start your jupyter server, adapt the information in the cells
- Adapt your queries in the query cells.
- Run your pipeline cell.
- You will get the answer as prompt below your cell. Additionally, your answer as separate file in the
responsesdirectory. The file is named after your query objective.
The code is licensed under MIT license. Refer to the LICENSE file for more information.