Skip to content

This repository contains a lang-chain pipeline for searching and summing large corpora of documents with LLMs.

License

Notifications You must be signed in to change notification settings

j-ehrhardt/doc-sum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python - 3.12 mamba - 3.23 License - MIT

doc-sum

This repository contains a divide and conquer lang-chain script for summarizing large text-corpora with LLMs. The script iteratively crawls through a directory of pdfs and feeds them into an instance of a GPT along with a query objective which is defined by you. The resulting excerpts are buffered in a dictionary and finally summed in a response summary. The response is saved in a separate file in the response directory.

Requirements

All requirements are saved in the venv.yml file. You can install the python environemnt by typing:

mamba env create -f venv.yml 

The main dependency is the langchain library.

Usage

Warning

This script only works, if you acquired tokens from OpenAI. Every use of the script will use tokens from your project's OpenAI account.

  1. Install the mamba environment.
  2. Save the documents of your corpus as individual documents into the data directory.

You can run the script either from a jupyter notebook or from the command line.

When running the script from the command line:

  1. Open your terminal and navigate to the code directory.
  2. activate your mamba environment and type:
python  main.py --query '<YOUR-QUERY>'
  1. You can find your answer as separate file in the responses directory. The file is named after your query objective.
  2. If you need help, type:
python main.py --help

When running the script from the jupyter notebook:

  1. Open your terminal and activate your mamba environment.
  2. Start your jupyter server, adapt the information in the cells
  3. Adapt your queries in the query cells.
  4. Run your pipeline cell.
  5. You will get the answer as prompt below your cell. Additionally, your answer as separate file in the responses directory. The file is named after your query objective.

License

The code is licensed under MIT license. Refer to the LICENSE file for more information.

About

This repository contains a lang-chain pipeline for searching and summing large corpora of documents with LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published