Skip to content
This repository was archived by the owner on Oct 14, 2025. It is now read-only.
/ website2llm Public archive

Scrapes the content of a webserver and provides a chat interface to retrieve information about it.

License

Notifications You must be signed in to change notification settings

Logogistiks/website2llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Website2llm

Maps all the publicly availabe pages that link to each other on a webserver based on a given starting url, then scrapes all the contents into a database so the user can retreive information about the webpages through an EM and LLM.

Note: The prompt template used for feeding the LLM is currently in german, if you want to change it or add aditional instructions, simply edit main.py and change the template function in the main section.

Workflow

Process

Process

Process

Prequisites

  • Python3
  • pip
  • git

Installation

Clone the repo with this command:

git clone https://github.com/Logogistiks/website2llm

Then cd into the folder:

cd website2llm

Then just run setup.py. As you can see in the diagram above, it creates a virtual environment, installs all its requirements and creates a template config file.

Configuration

Edit config.cfg. Things you can specify in the default category:

Parameter Description Possible Values
website Link to a page on the webserver you want to scrape
embeddingmodel The embedding model used for converting sentences to vectors List of available models: https://www.sbert.net/docs/pretrained_models.html#model-overview
answermodel The model used for answering the question of the user List of available models: https://observablehq.com/@simonw/gpt4all-models
singlestore Whether to store every <p> tag into a single DB entry or a whole page yes <> no
similarnum The number of DB entries used to generate the answer
timestamp Whether to print a timestamp before and after interacting with the LM yes <> no

(Optional)
Entering something into the ignoreendings category ignores sites with the specified endings while scraping, e.g. for sites that are password protected or you simply don't want in your database

(Optional)
Entering something into the ignoreimpure category ignores sites whose link contains the specified string, eg for subdirectories

Usage

Please make sure to edit the config file first, as described above.

Updating

To update the database, run update.py with the python interpreter from the venv (on Windows: .venv\Scripts\python.exe ; on Linux: .venv/bin/python3). Updating is required if the website you scraped last was updated or if you want to scrape a new one. For the latter, make sure to update the config file.

The process is verbose by default, if you don't want any output (not recommended), you can edit update.py and add the verbose=False parameter to the updateData() and updateDB() function calls in the main section.

The creation of the embedding vectors can take some time, especially if the webserver contains a lot if sites, so please be patient.

Information retrieval

Simply run main.py using the interpreter as specified above, type in your question about the website and wait for the response. Please note that this process can take a lot of time, during my testing up to a full minute.

Disclaimer

This project is very experimental, it is more of a concept than a serious tool. I just wanted to see if something like this is possible. The answer is yes, but very very slowly.
I admit that the code of this project is garbage, but this was thrown together in about 3 weeks.
I don't know if, what or how much improvements are coming in the future, but I plan to somehow make this conversation friendly, so you can chat with it like a real chat model.
Please be patient, the embedding- or answer-generating- process may take a lot of time.

Warning: During my testing, the traffic was relatively low, with a peak of ~1KB per second.
But please inform yourself about the rules concerning webscraping tools on the target websites, Im not responsible for your actions.

References

LLM Python API: https://llm.datasette.io/en/stable/

Embedding model: https://github.com/simonw/llm-sentence-transformers

Answer model: https://github.com/simonw/llm-gpt4all

Graph Software for readme-Image: https://www.yworks.com/products/yed

Markdown table generator: https://www.tablesgenerator.com/markdown_tables

About

Scrapes the content of a webserver and provides a chat interface to retrieve information about it.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages