Skip to content

Intelligent Search Engine for Belief Legend Embeddings

License

Notifications You must be signed in to change notification settings

broadwell/isebelle

Repository files navigation

ISEBELLE

Intelligent Search Engine for Belief Legend Embeddings

Installation

  1. Clone this repository: git clone https://github.com/broadwell/isebelle.git and cd isebelle
  2. Install and run Docker
  3. Install the just runner for your system
  4. Create an appropriate .env file in the isebelle main folder:
cat > .env <<EOL
DB_NAME=isebelle
DB_USER=isebelle
DB_PASSWORD=$(LC_ALL=C tr -cd 'a-zA-Z0-9' < /dev/urandom | fold -w24 | head -n 1)
# DB_HOST=localhost    ## Ignored if using docker-compose.yaml
# DB_PORT=5432         ## Ignored if using docker-compose.yaml

STORIES_SRC_FOLDER=/path/to/source/collections

JUPYTER_PASSWORD=secret_password

LOG_LEVEL=INFO       ## Set to DEBUG for additional logging
EOL

Building/running the app and adding new story collections

  1. Run docker compose up or just up
  2. If you'll be working with Icelandic, Frisian or Low German texts, be sure to run just build-[icelandic|frisian|low-german]-dictionary once for each language.
  3. Each collection should be a sub-folder of the main data folder identified by STORIES_SRC_FOLDER in .env. The actual story files should be in individual .xml files, with the filename providing the story ID, in the following folder structure: [collection_folder]/records/isebel/. It is preferred to use underscores (_) rather than spaces in the file and folder names. Ideally, each collection should contain stories in a single language. The name of the collection folder should be supplied as the [COLLECTION_NAME] argument to the scripts below.
  4. It can be faster to generate the story sentence embeddings outside of the Docker containers, by running
    python api/create_collection_embeddings.py --collection-path [PATH_TO_COLLECTION_FOLDER]
    but in this case you will need to install the script's dependencies in your local environment.
  5. If the story embeddings have been generated as above, you should load the collection's story texts first by running
    just add-collection-xml [COLLECTION_NAME] [ORGANIZATION] [COUNTRY] [SEARCH_LANGUAGE] [DISPLAY_LANGUAGE],
    then just add-embeddings [COLLECTION_NAME]/[EMBEDDINGS_FILENAME]
    For example,
    ~/isebelle$ just add-collection-xml Evald_Tang_Kristensen "UC Berkeley" Denmark Danish Dansk
    then
    ~/isebelle$ just add-embeddings Evald_Tang_Kristensen/Evald_Tang_Kristensen.embeddings.gte-Qwen2-7B-instruct.jsonl
    If you prefer to generate the story embeddings within the Docker containers while simultaneously importing them along with the story texts, run
    just add-collection-and-calculate-embeddings [COLLECTION_NAME] [ORGANIZATION] [COUNTRY] [SEARCH_LANGUAGE] [DISPLAY_LANGUAGE] instead.
  6. The search and browse interface for the collections and embeddings should be available at http://localhost:808/isebelle

Deploying to a server

The installation steps above (including the Docker setup) should be sufficient to run ISEBELLE on a remote server. Some reverse proxy configuration of the web server may be needed to make the site accessible via the web, however. The following configurations have been successfully used with Apache on the host server, including a TLS setup with automatic HTTP->HTTPS mapping. The Jupyter functionality is not fully tested yet.

# Route paths beginning with /isebelle or /jupyter to the Docker containers
ProxyPass /isebelle http://127.0.0.1:8080/isebelle
# Note this will run a live (albeit password-protected) Jupyter server!
ProxyPass /jupyter http://127.0.0.1:8080/jupyter

# Attempt also to route websocket requests to Docker (for Jupyter support)
RewriteEngine On
RewriteCond %{REQUEST_URI}  ^/socket.io            [NC]
RewriteCond %{QUERY_STRING} transport=websocket    [NC]
RewriteRule /(.*)           ws://localhost:8080/$1 [P,L]

About

Intelligent Search Engine for Belief Legend Embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published