Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[flake8]
exclude = experiments,migrations,settings.py
exclude = experiments,migrations,settings.py,venv/
max-line-length = 88
12 changes: 9 additions & 3 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,19 @@
name: Test

on: [push, pull_request]
on:
push:
branches:
- main
pull_request:

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- uses: actions/checkout@v5
- uses: actions/setup-python@v6
with:
python-version: '3.12.7'
- run: pip install -r web/requirements.txt
- run: pip install black isort flake8
- run: python3 -m black --check .
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
venv
__pycache__
web/db.sqlite3
web/**/*.sqlite3
**/.env
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12.7
32 changes: 32 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
prepare-web:
pip install -r web/requirements.txt
cp web/.env.example web/.env
python ./web/manage.py migrate
python ./web/manage.py createsuperuser

install-dev:
pip install -r requirements.txt

install-scispacy:
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_lg-0.5.4.tar.gz

start:
python ./web/manage.py runserver

populate-db:
python ./web/manage.py import_wikidata

clear-db:
python ./web/manage.py clear_wikidata

compute-concepts:
python ./web/manage.py compute_concepts

categorize:
python ./web/manage.py categorize --limit 10

fix-files:
pip install -r requirements.txt
python3 -m black .
python3 -m isort .
python3 -m flake8 .
118 changes: 86 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,35 +8,87 @@ For a demonstration of a page with at least one link, see for example `{baseurl}

To install all the necessary Python packages, run:

pip install -r requirements.txt
```bash
make prepare-web # Which does the necessary steps for env, db, superuser
# OR
pip install -r web/requirements.txt
```

Prepare an environment:
```bash
cp web/.env.example web/.env
```

Next, to create a database, run:

python manage.py migrate
```bash
python manage.py migrate
```

In order to use the administrative interface, you need to create an admin user:

python manage.py createsuperuser
```bash
python manage.py createsuperuser
```

Finally, to populate the database, run

python manage.py import_wikidata
```bash
python manage.py import_wikidata
# OR
make populate-db
```

* In order to fetch wikipedia articles and extract keywords from them:
```bash
make install-scispacy
```
then configure your email `WIKIPEDIA_CONTACT_EMAIL` in [source_wikidata.py](web/slurper/source_wikidata.py)
* This is needed
* Then run the database population (make sure your db is cleared)



If you ever want to repopulate the database, you can clear it using

python manage.py clear_wikidata
```bash
python manage.py clear_wikidata
```

### To run the categorizer
The categorizer is setup to work with several models, divided into free and paid.
All of them are run locally, so expect some performance hits. The models are downloaded when the categorizer is
ran initially, and by default the free models are used.

The database needs to be filled in before running it, so:
```bash
make populate-db
```
then
```bash
make categorize
```

There are some known existing issues that have some inline fixes, such as `gpt2` getting stuck
and returning the same prompt, then few times `---\n\n\n---`.

For more details see [categorizer readme](web/categorizer/README.md).

## Notes for developers

In order to contribute, install [Black](https://github.com/psf/black) and [isort](https://pycqa.github.io/isort/) autoformatters and [Flake8](https://flake8.pycqa.org/) linter.

pip install black isort flake8
```bash
make install-dev
```

You can run all three with

isort .
black .
flake8
```bash
make fix-files
# Or manually
isort .
black .
flake8
```

or set up a Git pre-commit hook by creating `.git/hooks/pre-commit` with the following contents:

Expand All @@ -47,35 +99,37 @@ black . && isort . && flake8
```

Each time after you change a model, make sure to create the appropriate migrations:

python manage.py makemigrations
```bash
python manage.py makemigrations
```

To update the database with the new model, run:

```bash
python manage.py migrate
```

## Instructions for Katja to update the live version

sudo systemctl stop mathswitch
cd mathswitch
git pull
source venv/bin/activate
cd web
./manage.py rebuild_db
sudo systemctl start mathswitch

```bash
sudo systemctl stop mathswitch
cd mathswitch
git pull
source venv/bin/activate
cd web
./manage.py rebuild_db
sudo systemctl start mathswitch
```
## WD item JSON example

```
```json
{
'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q192276'},
'art': {'type': 'uri', 'value': 'https://en.wikipedia.org/wiki/Measure_(mathematics)'},
'image': {'type': 'uri', 'value': 'http://commons.wikimedia.org/wiki/Special:FilePath/Measure%20illustration%20%28Vector%29.svg'},
'mwID': {'type': 'literal', 'value': 'Measure'},
'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'measure'},
'itemDescription': {'xml:lang': 'en', 'type': 'literal', 'value': 'function assigning numbers to some subsets of a set, which could be seen as a generalization of length, area, volume and integral'},
'eomID': {'type': 'literal', 'value': 'measure'},
'pwID': {'type': 'literal', 'value': 'Definition:Measure_(Measure_Theory)'
"item": {"type": "uri", "value": "http://www.wikidata.org/entity/Q192276"},
"art": {"type": "uri", "value": "https://en.wikipedia.org/wiki/Measure_(mathematics)"},
"image": {"type": "uri", "value": "http://commons.wikimedia.org/wiki/Special:FilePath/Measure%20illustration%20%28Vector%29.svg"},
"mwID": {"type": "literal", "value": "Measure"},
"itemLabel": {"xml:lang": "en", "type": "literal", "value": "measure"},
"itemDescription": {"xml:lang": "en", "type": "literal", "value": "function assigning numbers to some subsets of a set, which could be seen as a generalization of length, area, volume and integral"},
"eomID": {"type": "literal", "value": "measure"},
"pwID": {"type": "literal", "value": "Definition:Measure_(Measure_Theory)"}
}
```

Expand Down
5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
black~=25.9.0
isort~=5.12.0
flake8~=7.3.0

-r ./web/requirements.txt
2 changes: 2 additions & 0 deletions web/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SECRET_KEY="django-insecure-9wy9w#vf^tde0262doyy_j19=64c()_qub!1)f+fh-b^=7ndw*"
WIKIPEDIA_CONTACT_EMAIL=my@email.com
141 changes: 141 additions & 0 deletions web/categorizer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Categorizer Module

The categorizer module provides LLM-powered categorization of mathematical concepts.

## Setup

### 1. Install Required Dependencies

**For FREE local models (recommended):**
```bash
make install
```

**For paid API models (optional):**

For OpenAI:
```bash
pip install openai
```

For Anthropic Claude:
```bash
pip install anthropic
```

**For Ollama (free local alternative):**
1. Install Ollama from https://ollama.ai
2. Install langchain-community: `pip install langchain-community`
3. Pull a model: `ollama pull llama2`

### 2. Configure API Keys (only for paid models)

Set the appropriate environment variable for your chosen LLM provider:

**For OpenAI:**
```bash
export OPENAI_API_KEY="your-openai-api-key-here"
```

**For Anthropic Claude:**
```bash
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"
```

**For Ollama (optional):**
```bash
export OLLAMA_MODEL="llama2" # Default is llama2
```

You can also add these to a `.env` file or your shell configuration file (`.bashrc`, `.zshrc`, etc.).

## Usage

### Basic Usage

Categorize all items using the default FREE LLM (HuggingFace FLAN-T5):
```bash
python manage.py categorize
```

### With Options

Categorize a limited number of items:
```bash
python manage.py categorize --limit 10
make categorize
# OR
```

Use a specific LLM provider:

**FREE models (run locally):**
```bash
# Use HuggingFace FLAN-T5 (default, free, good for instruction following)
python manage.py categorize --llm huggingface_flan_t5

# Use HuggingFace GPT-2 (free, generative model)
python manage.py categorize --llm huggingface_gpt2

# Use HuggingFace DialoGPT (free, conversational model)
python manage.py categorize --llm huggingface_dialogpt

# Use Ollama (free, requires Ollama installed)
python manage.py categorize --llm ollama
```

**Paid API models:**
```bash
# Use OpenAI GPT-4 (requires API key)
python manage.py categorize --llm openai_gpt4

# Use OpenAI GPT-3.5 Turbo (requires API key)
python manage.py categorize --llm openai_gpt35

# Use Anthropic Claude (requires API key)
python manage.py categorize --llm anthropic_claude
```

Combine options:
```bash
python manage.py categorize --limit 5 --llm huggingface_flan_t5
```

## Architecture

- `categorizer_service.py` - Main service for categorizing items
- `llm_service.py` - Service for calling various LLM APIs
- `management/commands/categorize.py` - Django management command

## Supported LLMs

### Free Models (No API Key Required)
1. **HuggingFace FLAN-T5** - Google's instruction-following model (recommended for tasks)
2. **HuggingFace GPT-2** - OpenAI's classic generative model
3. **HuggingFace DialoGPT** - Microsoft's conversational model
4. **Ollama** - Run any Ollama model locally (llama2, mistral, etc.)

### Paid API Models (Require API Key)
1. **OpenAI GPT-4** - Most capable, but expensive
2. **OpenAI GPT-3.5 Turbo** - Fast and cheaper than GPT-4
3. **Anthropic Claude** - High quality, good reasoning

## Performance Notes

- **Free models** run locally and don't require internet/API keys, but:
- First run downloads the model (~1-3GB depending on model)
- Requires sufficient RAM (4-8GB+ recommended)
- Slower than API models (especially without GPU)

- **API models** are faster but cost money per request

- **Ollama** is a good middle ground - free, local, and supports many models

## Extending

To add support for additional LLM providers:

1. Add a new entry to the `LLMType` enum in `llm_service.py`
2. Implement a new private method (e.g., `_call_new_provider`) in the `LLMService` class
3. Add the new provider to the `call_llm` method's conditional logic
4. Update the command choices in `management/commands/categorize.py`
Empty file added web/categorizer/__init__.py
Empty file.
Loading