Website2llm

Maps all the publicly availabe pages that link to each other on a webserver based on a given starting url, then scrapes all the contents into a database so the user can retreive information about the webpages through an EM and LLM.

Note: The prompt template used for feeding the LLM is currently in german, if you want to change it or add aditional instructions, simply edit main.py and change the template function in the main section.

Prequisites

Python3
pip
git

Installation

Clone the repo with this command:

git clone https://github.com/Logogistiks/website2llm

Then cd into the folder:

cd website2llm

Then just run setup.py. As you can see in the diagram above, it creates a virtual environment, installs all its requirements and creates a template config file.

Configuration

Edit config.cfg. Things you can specify in the default category:

Parameter	Description	Possible Values
`website`	Link to a page on the webserver you want to scrape
`embeddingmodel`	The embedding model used for converting sentences to vectors	List of available models: https://www.sbert.net/docs/pretrained_models.html#model-overview
`answermodel`	The model used for answering the question of the user	List of available models: https://observablehq.com/@simonw/gpt4all-models
`singlestore`	Whether to store every `<p>` tag into a single DB entry or a whole page	`yes` <> `no`
`similarnum`	The number of DB entries used to generate the answer
`timestamp`	Whether to print a timestamp before and after interacting with the LM	`yes` <> `no`

(Optional)
Entering something into the ignoreendings category ignores sites with the specified endings while scraping, e.g. for sites that are password protected or you simply don't want in your database

(Optional)
Entering something into the ignoreimpure category ignores sites whose link contains the specified string, eg for subdirectories

Usage

Please make sure to edit the config file first, as described above.

Updating

To update the database, run update.py with the python interpreter from the venv (on Windows: .venv\Scripts\python.exe ; on Linux: .venv/bin/python3). Updating is required if the website you scraped last was updated or if you want to scrape a new one. For the latter, make sure to update the config file.

The process is verbose by default, if you don't want any output (not recommended), you can edit update.py and add the verbose=False parameter to the updateData() and updateDB() function calls in the main section.

The creation of the embedding vectors can take some time, especially if the webserver contains a lot if sites, so please be patient.

Information retrieval

Simply run main.py using the interpreter as specified above, type in your question about the website and wait for the response. Please note that this process can take a lot of time, during my testing up to a full minute.

Disclaimer

This project is very experimental, it is more of a concept than a serious tool. I just wanted to see if something like this is possible. The answer is yes, but very very slowly.
I admit that the code of this project is garbage, but this was thrown together in about 3 weeks.
I don't know if, what or how much improvements are coming in the future, but I plan to somehow make this conversation friendly, so you can chat with it like a real chat model.
Please be patient, the embedding- or answer-generating- process may take a lot of time.

Warning: During my testing, the traffic was relatively low, with a peak of ~1KB per second.
But please inform yourself about the rules concerning webscraping tools on the target websites, Im not responsible for your actions.

References

LLM Python API: https://llm.datasette.io/en/stable/

Embedding model: https://github.com/simonw/llm-sentence-transformers

Answer model: https://github.com/simonw/llm-gpt4all

Graph Software for readme-Image: https://www.yworks.com/products/yed

Markdown table generator: https://www.tablesgenerator.com/markdown_tables

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean.py		clean.py
main.py		main.py
setup.py		setup.py
update.py		update.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website2llm

Prequisites

Installation

Configuration

Usage

Updating

Information retrieval

Disclaimer

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Logogistiks/website2llm

Folders and files

Latest commit

History

Repository files navigation

Website2llm

Prequisites

Installation

Configuration

Usage

Updating

Information retrieval

Disclaimer

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages