Skip to content

vlorenzo/tyk25

Repository files navigation

Semantic Graph Application

A clean, extensible application that turns unstructured text into a semantic network for exploration and retrieval. It builds a single graph with typed edges: co-occurrence edges from keyword extraction and semantic edges refined via embeddings + a lightweight GCN. The graph is persisted to Neo4j and exposed via a FastAPI backend; a React/Vite frontend provides a minimal UI for building and exploring.

Features

  • Typed-edge semantic graph (cooccurrence | semantic)
  • Multilingual keyword extraction (EN/IT, auto-detect with override)
  • Embeddings + GCN refinement (PyTorch + PyG)
  • Community detection (Louvain)
  • Neo4j storage (Keyword, Document, RELATED {type, weight}, IN_DOC)
  • API-first backend (FastAPI), minimal React/Vite frontend
  • Apple Silicon–friendly setup; Docker/Compose; CI; structured logging

Requirements

  • macOS (Apple Silicon) or Linux, Python 3.12+
  • Neo4j 5.x (local or Docker)
  • Node.js 20+ (for the frontend)

Quickstart (local, Apple Silicon)

  1. Create environment and install backend deps (uses uv)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv .venv && source .venv/bin/activate
uv sync
  1. Start Neo4j locally and create indexes
brew install neo4j && neo4j start
# Option A: paste into Neo4j Browser (http://localhost:7474)
CREATE INDEX IF NOT EXISTS FOR (k:Keyword) ON (k.name);
CREATE INDEX IF NOT EXISTS FOR (d:Document) ON (d.id);

# Option B: run via cypher-shell (local install)
cypher-shell -a bolt://localhost:7687 -u neo4j -p password -f scripts/setup_neo4j.cypher

# Option C: with Docker Compose Neo4j
# copy the script into the container and run it via cypher-shell
# (uses NEO4J_PASSWORD env or defaults to "password")
docker compose cp scripts/setup_neo4j.cypher neo4j:/var/lib/neo4j/import/setup_neo4j.cypher
docker compose exec neo4j cypher-shell -u neo4j -p "${NEO4J_PASSWORD:-password}" -f /var/lib/neo4j/import/setup_neo4j.cypher
  1. Run backend API
uv run uvicorn app.main:app --reload
  1. Run frontend
cd frontend
npm install
npm run dev

Open http://localhost:3000.

Quickstart (Docker Compose)

docker compose up --build
  • Backend: http://localhost:8000 (OpenAPI at /docs)
  • Frontend: http://localhost:3000
  • Neo4j: Bolt bolt://localhost:7687, Browser http://localhost:7474

Configuration

Copy config.yaml.example to config.yaml and adjust as needed.

  • preset: fast | balanced | max_quality
  • embedding_model, keyword_model
  • top_keywords, gnn_epochs, similarity_threshold
  • language: auto | en | it | ...
  • neo4j_uri, neo4j_user, neo4j_password
  • community_min_size

You can override preset, top_keywords, and similarity_threshold per /build request.

How to use

From the CLI (recommended for large corpora)

Process local text files directly without going through the web API:

# Quick keyword preview (fast, ~5-10 seconds)
uv run python tykli.py build --corpus ./data/sample_corpus/ --preview

# Full pipeline test (no database save)
uv run python tykli.py build --corpus ./data/sample_corpus/ --dry-run

# Build and save to Neo4j
uv run python tykli.py build --corpus ./data/sample_corpus/

# With quality preset override
uv run python tykli.py build --corpus ./data/docs/ --preset max_quality

# With custom parameters
uv run python tykli.py build --corpus ./data/docs/ --threshold 0.65 --top-keywords 20

See CLI_USAGE.md for detailed documentation.
See MODES_COMPARISON.md for quick reference on --preview vs --dry-run.

Corpus preparation:

  1. Create directory: mkdir -p data/my_corpus
  2. Add .txt files (one document per file)
  3. Run: uv run python tykli.py build --corpus ./data/my_corpus/

Results are saved to Neo4j and immediately queryable via the web UI.

From the frontend

  1. Click "Build" to run ingestion → refinement → analysis → persistence on a small demo corpus.
  2. Toggle edge type (all | co-occurrence | semantic) and refresh to see graph stats and sample edges.
  3. Query documents by terms (AND mode).
  4. Inspect Neighbors for a term; choose semantic or co-occurrence neighbors.

From the API (curl examples)

  • Build a graph from inline corpus (balanced preset, custom threshold)
curl -X POST http://localhost:8000/build \
  -H 'Content-Type: application/json' \
  -d '{
        "corpus": [
          "Graph neural networks refine embeddings for semantic similarity.",
          "Co-occurrence graphs capture term relationships in documents.",
          "Community detection reveals topic clusters in a knowledge graph."
        ],
        "preset": "balanced",
        "similarity_threshold": 0.6
      }'
  • Check job status
curl http://localhost:8000/build/<JOB_ID>/status
  • Get graph (semantic-only edges, limit 500)
curl "http://localhost:8000/graph?edgeType=semantic&limit=500"
  • Get neighbors for a term (co-occurrence)
curl "http://localhost:8000/query/neighbors/graph?edgeType=cooccurrence&limit=10"
  • Get documents by terms (AND mode)
curl "http://localhost:8000/query/docs?terms=graph,semantic&mode=and"
  • Get communities
curl http://localhost:8000/analyze/communities

Data model

  • Node: Keyword {name, weight, community}
  • Edge: RELATED {type: 'cooccurrence'|'semantic', weight: float}
  • Document: Document {id, content} (content truncated)
  • Relationships: (k)-[:IN_DOC]->(d), (k1)-[:RELATED]->(k2)

Presets & performance

  • fast: small model, fewer keywords, fewer epochs (quick exploration)
  • balanced (default): multilingual mpnet, mid keywords/epochs (quality/perf)
  • max_quality: all-mpnet-base-v2, more keywords/epochs (best semantics)

Targets (M4 Max): build 1k docs in < 8 min (first run downloads models), query neighbors in < 300ms.

Development

  • Tests
uv run pytest -q
  • Lint & type check
uv run ruff check .
uv run mypy app
  • Logs: structured JSON; stages logged (ingestion/refinement/analysis/persist)

Troubleshooting

  • Torch/PyG install on macOS: pins are in pyproject.toml/uv.lock. Prefer CPU; MPS may not accelerate all ops.
  • Neo4j auth: default user neo4j, password password (change in config.yaml).
  • Model download slow: ensure network; set local cache (e.g., HF_HOME).

Roadmap

  • Better interactive graph viz (physics layouts, subgraph focus)
  • Optional vector DB for hybrid search
  • Cloud offloading for GNN refinement
  • Advanced entity/phrase extraction and document preview UX

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published