Feat: Complete Neuro-Symbolic Pipeline with Docker Infra by NakulSingh156 · Pull Request #32 · dbpedia/neural-extraction-framework

NakulSingh156 · 2026-02-11T06:03:52Z

Architecture Overhaul: Neuro-Symbolic Pipeline & Docker Infrastructure

Motivation

This PR significantly refactors the extraction pipeline, transitioning from a heuristic-based standalone script to a robust, reproducible Neuro-Symbolic Architecture fully aligned with the GSoC25 Redis infrastructure. The goal is to provide a deterministic, hallucination-resistant extraction layer that seamlessly integrates with DBpedia's existing backend.

Core Architectural Updates

Dockerized Environment: Orchestrated the pipeline using docker-compose, isolating the Python extraction environment and the Redis backend for zero-config reproducibility.
Redis-Backed Entity Resolution: Replaced external API calls with an offline, high-performance Redis cache. Includes seed_redis.py to securely mock DBpedia fixtures (Anchor Candidates & Types) for end-to-end testing without requiring the 50GB data dump.

Neuro-Symbolic Features

The "Hallucination Buster" (Ontology Guardrails): Integrated a strict SPARQL validation layer. If the Neural model (all-MiniLM-L6-v2) extracts a semantically plausible but factually incorrect triple (e.g., Ronaldo -> playsFor -> Chicago Bulls), the Symbolic layer intercepts and rejects it as contradicting world knowledge.
Graph-Proximity Reasoning: Adapted validation to handle historical/complex facts (e.g., Ronaldo -> Real Madrid). If strict ontology matching fails due to temporal shifts, the system falls back to a BFS traversal, exploiting dbo:wikiPageWikiLink to compute contextual relatedness.
Deterministic Extraction: Refactored the Spacy pipeline to enforce deterministic Subject/Object assignment based on character-index sorting, resolving non-deterministic set behaviors.

CI/CD & Code Quality

Resolved all static analysis warnings flagged by CodeRabbit AI (removed dead code, fixed bare exceptions, sanitized return types, and standardized environment variables).

Proof of Build (End-to-End Pipeline Validation)

1. Infrastructure: Clean Redis Seeding

2. Extraction & Validation: Active Voice, Passive Voice, and Hallucination Rejection
Notice how the symbolic layer successfully validates complex sentence structures while strictly rejecting the hallucinated 'Chicago Bulls' output.

3. Client-Side Graph Reasoning (BFS Traversal)
Finding contextual links for complex entities directly through the Knowledge Graph.

Things Addressed in this PR:

Migrated to offline Redis infrastructure per GSoC25 guidelines.
Corrected previous benchmark anomalies and expanded testing to a 20-item Mega-Benchmark.
Replaced hardcoded weights and stopwords with a dynamic HybridLinker.
Abstracted predicate matching via a ModelInterface using all-MiniLM-L6-v2.

coderabbitai · 2026-02-11T06:04:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds containerization and docs plus a neuro‑symbolic prototype: benchmark data, Redis-backed entity linking and redirect resolution (Wikipedia/DBpedia), relation extraction using embeddings, SPARQL-based reasoning/visualization, and triple validation modules.

Changes

Cohort / File(s)	Summary
Infrastructure & Containerization `\.dockerignore`, `Dockerfile`, `docker-compose.yml`, `requirements.txt`	Adds Docker ignore patterns; new Python 3.10-slim Dockerfile; docker-compose adds a Redis service, configures build/volumes/env for the extractor service and healthcheck dependency; pins ML/NLP/graph dependencies.
Documentation `README.md`	Adds a comprehensive README describing the hybrid pipeline, architecture, benchmarks, run instructions, and roadmap.
Benchmark Data `benchmarks/sentences.json`	Adds a 52-sentence JSON dataset for evaluation/benchmarks.
Redis & Seeding `src/redis_client.py`, `src/seed_redis.py`	Adds Redis client wrapper (redirects, anchors, types) and a seed script that flushes DB and populates redirect/anchor/type keys for local testing.
Wikipedia Resolver `src/wiki_resolver.py`	Adds redirect resolution with redirect-following, disambiguation detection, and OpenSearch fallback.
DBpedia Lookup & Ranking `src/dbpedia_linker.py`	Adds DBpedia Lookup integration: label cleaning, candidate expansion, filtering (fuzzy + coverage), multi-factor scoring (lexical, type priors, popularity), and best-entity selection.
Hybrid Linker & NER `src/linker.py`	Adds HybridLinker using Redis lookups, redirect resolution, spaCy NER (with PROPN fallback), simple disambiguation using string similarity, and wrapper functions for resolves and span extraction.
Relation Extraction & Demo `src/main.py`	Adds RelationExtractor using sentence-transformers for predicate matching, Pipeline that wires extractor/linker/validator, and a CLI/demo entrypoint.
Neuro‑Symbolic Reasoning `src/reasoning.py`	Adds NeuroSymbolicReasoner performing direct and constrained 2‑hop SPARQL searches against DBpedia, and graph plotting to a PNG.
Fact Validation `src/validation.py`	Adds FactChecker that sanitizes inputs and runs two-tier SPARQL ASK queries (strict triple and graph proximity) to validate subject–predicate–object relations.
Project Metadata `.dockerignore`, `requirements.txt`	Minor additions to .dockerignore and pinned/added Python dependencies required by new modules.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Extractor as RelationExtractor
    participant Linker as HybridLinker
    participant Redis as Redis (local)
    participant Wiki as Wikipedia API
    participant DBpedia as DBpedia Lookup/SPARQL
    participant Reasoner as NeuroSymbolicReasoner
    participant Validator as FactChecker

    Client->>Extractor: submit text
    Extractor->>Linker: extract_entity_spans(text)
    Linker->>Redis: get_redirect / get_entities / get_type
    alt redirect not in Redis
        Linker->>Wiki: resolve_redirect(span)
        Wiki-->>Linker: resolved title
    end
    Linker->>DBpedia: lookup candidates
    DBpedia-->>Linker: candidate results
    Linker-->>Extractor: best entity & types

    Extractor->>Reasoner: find_path(subj_entity, obj_entity)
    Reasoner->>DBpedia: SPARQL direct-link query
    alt direct link found
        DBpedia-->>Reasoner: predicate/triple
    else
        Reasoner->>DBpedia: fetch neighbors (1-hop)
        DBpedia-->>Reasoner: neighbor list
        Reasoner->>DBpedia: query connecting neighbor -> end
        DBpedia-->>Reasoner: connecting triple
    end
    Reasoner-->>Extractor: semantic path (PNG saved)

    Extractor->>Validator: validate_triple(subj_uri, pred_uri, obj_uri)
    Validator->>DBpedia: ASK Tier1 / Tier2 SPARQL
    DBpedia-->>Validator: boolean result
    Validator-->>Extractor: validation outcome
    Extractor-->>Client: final extraction + validation result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Added comprehensive README.md for main repository #22: Overlapping README/documentation additions and restructuring.
GSoC 25: Neural Extraction Framework #21: Overlapping/nearly-duplicate entity-linking, Redis-backed resolver, DBpedia/Wikipedia lookup, and pipeline components.

Suggested reviewers

mommi84
smilingprogrammer

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (8 files): ⚔️ `GSoC24_H/models/download_models.sh` (content) ⚔️ `GSoC24_H/requirements.txt` (content) ⚔️ `GSoC24_H/src/chunking/crf_chunker.py` (content) ⚔️ `GSoC24_H/src/demo.py` (content) ⚔️ `GSoC24_H/src/entity_linking.py` (content) ⚔️ `GSoC24_H/src/indIE.py` (content) ⚔️ `GSoC24_H/src/start.py` (content) ⚔️ `README.md` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Feat: Complete Neuro-Symbolic Pipeline with Docker Infra' clearly and specifically summarizes the main changes: introduction of a complete neuro-symbolic extraction pipeline and Docker infrastructure.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch feature/docker-final
Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 20

🤖 Fix all issues with AI agents

In `@docker-compose.yml`:
- Around line 10-11: The restart policy "restart: unless-stopped" is causing an
infinite loop because the one-shot command "command: python src/main.py" exits;
remove or change the restart policy so the service does not auto-restart for a
demo run (delete or comment out "restart: unless-stopped"), or alternatively
change the default command to a long-running placeholder (replace "command:
python src/main.py" with the same tail keepalive used in the Dockerfile, e.g.
tail -f /dev/null) and run the demo via docker compose exec/run when needed;
update the docker-compose service block containing "command: python src/main.py"
and "restart: unless-stopped" accordingly.
- Around line 7-9: MAX_RETRIES and DBPEDIA_ENDPOINT in docker-compose.yml are
not consumed by the app; either remove them from the compose file or wire them
into the application's configuration. To fix, either (A) delete the MAX_RETRIES
and DBPEDIA_ENDPOINT environment entries from docker-compose.yml, or (B) read
them from the environment in your config layer (e.g., add
os.getenv('MAX_RETRIES') and os.getenv('DBPEDIA_ENDPOINT') in the project's
configuration module or Config class) and then use MAX_RETRIES in the retry
logic (e.g., in any fetch_with_retry / retry loop function) and DBPEDIA_ENDPOINT
where the DBPedia client or query function (e.g., DBPediaClient, query_dbpedia)
constructs SPARQL requests so the values are actually applied at runtime.

In `@README.md`:
- Around line 145-150: Update the "Roadmap for GSoC 2026" section to remove the
claim that Dockerization is future work and instead state that Dockerization has
been implemented (referencing the added Dockerfile and docker-compose.yml), then
list remaining infra tasks such as providing a local SPARQL endpoint, adding
CI/CD steps to build and publish container images, and documenting how to run
the stack locally and in CI.
- Around line 108-141: Add a "Running with Docker" subsection to the README.md
under "How to Run" that documents using the included Dockerfile and
docker-compose.yml; specifically instruct to build and start the services with
the command `docker compose up --build`, mention any prerequisite Docker/Docker
Compose requirements, and note which containers correspond to the main entries
(e.g., main.py, reasoning.py, validation.py) so users know how to run or
override individual services via docker-compose.
- Around line 111-118: Replace the hardcoded local path command cd
"Desktop/Neural Extraction Framework Folder" in the README with a generic
repository clone and change-directory flow: remove that specific cd line and
instead instruct users to clone the repository (e.g., via git clone <repo-url>)
and then cd into the repository directory (neural-extraction-framework), leaving
the subsequent dependency installation steps (pip install -r requirements.txt
and spacy model download) unchanged; update the README text around these
commands to reflect the new, generic setup instructions.

In `@requirements.txt`:
- Around line 1-9: Update requirements.txt to fix the security issue and ensure
reproducible builds: replace the open/old requests entry with a fixed secure
version (e.g., requests==2.32.4 or >=2.32.4) to address CVE-2024-35195 and
CVE-2024-47081, and pin the currently unpinned heavy deps (torch, transformers,
urllib3) to exact, tested versions used by your Docker build (e.g.,
torch==<chosen-version>, transformers==<chosen-version>,
urllib3==<chosen-version>) or alternatively generate and commit a lock file
(pip-tools/poetry/poetry.lock/requirements.txt via pip freeze) so builds are
deterministic; update the entries for requests, torch, transformers, and urllib3
in requirements.txt accordingly.

In `@src/dbpedia_linker.py`:
- Around line 1-74: There is duplicated logic between clean_label, covers_query,
TYPE_PRIOR, STOPWORDS and get_best_entity; extract these shared constants and
helper functions into a single utility module (e.g., linker_utils) and have both
modules import them; reconcile the scoring differences by parameterizing variant
behavior (e.g., an optional expand_alias and context_score hook) so
get_best_entity in src.dbpedia_linker and src.linker delegate to a common
implementation while passing their specific scoring adjustments, and update
references to resolve_redirect and fuzz usage to use the shared helpers.
- Around line 46-50: The DBpedia Lookup call is missing a timeout and can hang;
update the requests.get call that uses DBPEDIA_LOOKUP inside the for term in
candidates loop to include timeout=5 (same as wiki_resolver) and keep the
existing try/except around it so failures time out instead of blocking; ensure
you pass timeout=5 to requests.get and leave the subsequent .json() processing
(r and clean_label) unchanged.

In `@src/linker.py`:
- Around line 48-61: This module defines a local resolve_redirect that diverges
from the more robust two-strategy implementation in the wiki_resolver module,
causing inconsistent resolution; remove the local resolve_redirect function and
import the shared implementation instead (e.g., import resolve_redirect from
wiki_resolver) so this module and dbpedia_linker use the exact same logic, or
alternatively refactor both to call a single new shared helper; ensure you
update any local references and tests to use the imported resolve_redirect so
behavior matches the redirect+pageprops then OpenSearch strategy.
- Around line 40-42: The three HTTP calls in expand_alias, resolve_redirect, and
get_best_entity use requests.get without timeouts and can hang; update each
requests.get invocation (the calls that build params for the Wikipedia API in
expand_alias, resolve_redirect, and get_best_entity) to include timeout=10
(e.g., requests.get(..., params=..., timeout=10)) so the calls fail fast like
other code (wiki_resolver.py) and prevent indefinite blocking.
- Line 117: The current normalization uses s.lstrip("the ") which removes any of
the characters 't','h','e',' ' from the left and mangles names; update the
normalization in the list comprehension that assigns normalized from spans to
remove the literal prefix "the " instead: either call s = s.strip() and then s =
s.removeprefix("the ").strip() (Python 3.9+), or for compatibility use if
s.lower().startswith("the "): s = s[4:].strip() else: s = s.strip(); then
collect into the set as before (refer to the normalized variable and the spans
iterable).

In `@src/main.py`:
- Around line 16-34: In extract(text, subj, obj) the use of text.find() can
return -1 causing bogus start/end offsets and a corrupted phrase; replace the
blind find logic by first checking the result of text.find(subj) and
text.find(obj) (and/or using case-insensitive search variants) and only compute
start = subj_idx + len(subj) and end = obj_idx when both indices are >= 0 and
start < end; otherwise fall back to a safe default (e.g., set phrase = text or
perform a best-effort substring using available index), remove the bare
try/except, and ensure downstream code (query_vec/model.encode and embedding
loop) always receives a well-defined non-corrupted phrase.
- Line 1: The extract method currently uses text.find() for subj/obj and
computes start/end without checking for -1; update extract to validate that
subj_idx and obj_idx returned by text.find(...) are not -1 before computing
offsets and slicing (variables: subj_idx, obj_idx, start, end, text.find, and
the extract method), and if either is -1 fall back to a safe extraction strategy
(e.g., take a fixed window around the other entity, or return the full
sentence/paragraph) ensuring start < end and indices are clamped to [0,
len(text)] before slicing; remove reliance on try/except for this case so no
invalid offsets get used.

In `@src/reasoning.py`:
- Around line 15-16: Sanitize entity names before building DBpedia resource
URIs: add a helper (e.g., make_dbpedia_resource_uri(entity)) that replaces
spaces with underscores and percent-encodes the result using
urllib.parse.quote(..., safe='_') (or equivalent URI-encoding), then replace
direct f-string interpolation at start_uri and end_uri and the other occurrences
flagged (the spots that build SPARQL URIs around lines 21, 33–37, 48) to call
this helper; also validate the input (raise or return error for empty/null) so
unsanitized characters like >, }, \ cannot be injected into SPARQL queries.
- Around line 68-81: The _plot_graph function currently calls plt.show() which
blocks in headless Docker; change it to use a non-interactive backend and save
the figure to disk instead: ensure matplotlib is configured to a non-interactive
backend (e.g., set matplotlib.use('Agg') at module import) or simply replace the
plt.show() call with plt.savefig(...) using a descriptive filename (and full
path if needed) and then call plt.close() to free resources; update _plot_graph
to write the generated plot to file and remove the interactive display call.

In `@src/validation.py`:
- Around line 4-5: The global call
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) in
validation.py should be removed to avoid suppressing TLS warnings for the whole
process; instead, delete that line and, if you absolutely must silence warnings
for a single request, wrap the specific request in a narrow
warnings.catch_warnings context or use requests with verify=False only for that
call (and document it), or better yet fix certificate handling; do not disable
InsecureRequestWarning at import time.
- Around line 26-30: The SPARQL query construction currently interpolates
user-supplied subj_uri and obj_uri directly into the f-string, allowing SPARQL
injection; update the code that builds the query (the block creating query using
subj_uri and obj_uri in src/validation.py) to either use a parameterized SPARQL
API (e.g., SPARQLWrapper with prepared/parameter bindings) or validate/sanitize
both subj_uri and obj_uri before interpolation — enforce strict IRI/URI
validation (allowed characters, proper angle-bracket-wrapped form) and reject or
escape any input that does not match; ensure the final query uses only validated
URIs or bound parameters and add unit tests for malicious inputs to confirm the
fix.
- Around line 8-9: The endpoint is hardcoded in the class __init__
(self.endpoint = "http://dbpedia.org/sparql"); change it to read the
DBPEDIA_ENDPOINT environment variable with a fallback to the current literal.
Import os if not already imported and set self.endpoint =
os.environ.get("DBPEDIA_ENDPOINT", "http://dbpedia.org/sparql") inside __init__
so the docker-compose-configured URL is honored.

In `@src/wiki_resolver.py`:
- Around line 28-49: The computed redirect target final_title is never used;
update the non-disambiguation return to prefer the redirect target (final_title)
when present (i.e., return final_title if redirects else page["title"]) inside
the pages loop (where page, pages, and is_disambig are checked), or remove
final_title entirely if you prefer not to support redirects—ensure the code in
the loop consistently uses final_title/term semantics so redirects aren't
ignored.

🧹 Nitpick comments (13)

benchmarks/sentences.json (1)

1-52: Benchmark data file is not referenced by any source code.

The README mentions this dataset for evaluation, but no module in src/ actually loads benchmarks/sentences.json. src/main.py hardcodes its own sentence list instead. Consider either loading this file in the demo/benchmark runner or noting in the README that it's a manual evaluation dataset.

Additionally, the .dockerignore excludes the benchmarks/ directory from the build context, so this file won't be baked into the image (though the docker-compose.yml volume mount would make it available at runtime).
Dockerfile (2)
13-16: Add --no-install-recommends to reduce image size.

This avoids pulling in unnecessary recommended packages, which can significantly bloat the slim base image.
Suggested fix
-RUN apt-get update && apt-get install -y \
+RUN apt-get update && apt-get install -y --no-install-recommends \
     build-essential \
     curl \
     && rm -rf /var/lib/apt/lists/*
28-31: Container runs as root — add a non-root user.

The container defaults to root, which is a security best practice violation (flagged by Trivy DS-0002). Since this service makes outbound HTTP requests to external endpoints, running as a non-privileged user limits the blast radius of any compromise.
Suggested fix
 COPY . .
 
+RUN useradd --create-home appuser
+USER appuser
+
 CMD ["tail", "-f", "/dev/null"]
src/main.py (1)

28-33: No minimum similarity threshold — any phrase will match some predicate.

best_score starts at 0.0, so even completely unrelated phrases will be assigned a predicate as long as cosine similarity is non-negative (which it almost always is for normalized embeddings). Consider adding a confidence threshold (e.g., 0.3) and returning "None" when the best score falls below it.
src/wiki_resolver.py (1)
51-52: Silent exception swallowing makes debugging infeasible.

Both except blocks (Lines 51–52 and 76–77) discard all errors. Since these make network calls to an external API, failures (DNS, rate-limiting, malformed JSON) will be invisible. At minimum, log the exception at debug/warning level so issues can be diagnosed in production or Docker runs.
Proposed fix
+import logging
+
+logger = logging.getLogger(__name__)
+
 ...
-    except Exception as e:
-        pass
+    except Exception:
+        logger.debug("Strategy 1 (redirect check) failed for term=%r", term, exc_info=True)
 ...
-    except:
-        pass
+    except Exception:
+        logger.debug("Strategy 2 (opensearch) failed for term=%r", term, exc_info=True)
Also applies to: 76-77
src/dbpedia_linker.py (3)
24-30: Ambiguous loop variable name l.

l is easily confused with 1 in many fonts. Use a more descriptive name like label_token.
Proposed fix
 def covers_query(query, label):
     q_tokens = [t for t in query.lower().replace("-", " ").split() if t not in STOPWORDS]
-    l_tokens = label.lower().replace("-", " ").split()
+    label_tokens = label.lower().replace("-", " ").split()
     for q in q_tokens:
-        if not any(l.startswith(q) for l in l_tokens):
+        if not any(lt.startswith(q) for lt in label_tokens):
             return False
     return True
64-66: Season penalty and exact-match bonus applied to the same score variable can conflict.

Line 65 multiplies by 0.1 if "season" appears in the label or types, then Line 66 multiplies by 1.5 for an exact match. A result labeled e.g. "2023–24 La Liga season" that exactly matches the query would get both penalties applied (score *= 0.1 * 1.5 = 0.15). This may be intentional, but the interaction is non-obvious. Consider documenting the expected stacking behavior or re-ordering the checks.

69-70: Bare except with continue silently hides all errors.

Same concern as in wiki_resolver.py. At minimum, catch Exception (not bare except) and log the failure with the candidate term for debuggability.
src/reasoning.py (3)
29-57: N+1 query pattern: up to 50 sequential SPARQL calls for 2-hop search.

The client-side BFS fetches up to 50 neighbors, then issues an individual SPARQL query for each one to check if it connects to the end entity. This results in up to 51 HTTP round-trips to the public DBpedia SPARQL endpoint, which is slow and likely to trigger rate limiting.

Consider consolidating into a single 2-hop SPARQL query with a LIMIT, or at least batch the neighbor checks using VALUES:
SELECT ?p1 ?mid ?p2 WHERE {
  <start> ?p1 ?mid .
  ?mid ?p2 <end> .
  FILTER(STRSTARTS(STR(?p1), "http://dbpedia.org/ontology/"))
} LIMIT 5
9-27: find_path relies entirely on side effects (print + plot) with no return value.

Returning None implicitly makes it impossible to programmatically check results or write tests. Consider returning the path_data list (or None on failure) so callers can act on the result.

Also applies to: 59-59

61-66: Bare except in _query hides network and parsing errors.

A timeout, connection error, or malformed JSON response will all be silently ignored. Log the exception and the failing SPARQL query to aid debugging.
src/linker.py (2)
7-11: Bare except on model load hides unexpected errors.

If spaCy is installed but the model load fails for a reason other than "not found" (e.g., corrupt model, incompatible version), this silently sets nlp = None and the pipeline degrades without any actionable diagnostic. Catch OSError (which spaCy raises for missing models) specifically.
Proposed fix
 try:
     nlp = spacy.load("en_core_web_sm")
-except:
+except OSError:
     print("⚠️ Warning: Spacy model 'en_core_web_sm' not found. Run: python -m spacy download en_core_web_sm")
     nlp = None
104-118: extract_entity_spans returns [] when nlp is None, with no signal to the caller.

When the spaCy model isn't loaded, this silently returns an empty list. Callers have no way to distinguish "no entities found" from "NER unavailable." Consider logging a warning or raising an exception so the failure mode is explicit.

coderabbitai · 2026-02-11T06:18:43Z

docker-compose.yml

+    environment:
+      - MAX_RETRIES=3
+      - DBPEDIA_ENDPOINT=https://dbpedia.org/sparql


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify that no source file reads these environment variables echo "=== Searching for MAX_RETRIES ===" rg -n 'MAX_RETRIES' --type=py echo "=== Searching for DBPEDIA_ENDPOINT ===" rg -n 'DBPEDIA_ENDPOINT' --type=py echo "=== Searching for os.environ or os.getenv ===" rg -n 'os\.(environ|getenv)' --type=py

Repository: dbpedia/neural-extraction-framework

Length of output: 386

Environment variables MAX_RETRIES and DBPEDIA_ENDPOINT in docker-compose.yml are never read by application code.

No Python modules reference these variables, and searches for os.environ or os.getenv confirm they are not used anywhere. These are dead configuration and should either be wired into the code or removed to avoid confusion.

🤖 Prompt for AI Agents

In `@docker-compose.yml` around lines 7 - 9, MAX_RETRIES and DBPEDIA_ENDPOINT in docker-compose.yml are not consumed by the app; either remove them from the compose file or wire them into the application's configuration. To fix, either (A) delete the MAX_RETRIES and DBPEDIA_ENDPOINT environment entries from docker-compose.yml, or (B) read them from the environment in your config layer (e.g., add os.getenv('MAX_RETRIES') and os.getenv('DBPEDIA_ENDPOINT') in the project's configuration module or Config class) and then use MAX_RETRIES in the retry logic (e.g., in any fetch_with_retry / retry loop function) and DBPEDIA_ENDPOINT where the DBPedia client or query function (e.g., DBPediaClient, query_dbpedia) constructs SPARQL requests so the values are actually applied at runtime.

docker-compose.yml

README.md

coderabbitai · 2026-02-11T06:18:43Z

README.md

+## Roadmap for GSoC 2026

- **Google Summer of Code** for supporting student contributions
- **DBpedia Association** for mentorship and infrastructure
- All **GSoC students** and **mentors** who have contributed
- The **open-source community** for tools and collaboration
-
---
+This prototype confirms the logic is sound. The next phase focuses on **Infrastructure**:

-<p align="center">
-  Built with ❤️ by the DBpedia Community
-</p>
-
-<p align="center">
-  <a href="https://www.dbpedia.org">
-    <img src="https://img.shields.io/badge/Visit-DBpedia.org-0066CC?style=for-the-badge" alt="Visit DBpedia">
-  </a>
-  <a href="https://github.com/dbpedia">
-    <img src="https://img.shields.io/badge/Explore-More_Repos-181717?style=for-the-badge&logo=github" alt="Explore Repos">
-  </a>
-</p>
+1. **Dockerization:** Containerizing this pipeline to bypass public API rate limits (Draft PR in progress).
+2. **Async/Batch Processing:** Upgrading the synchronous `requests` layer to `aiohttp` for 50x throughput.


⚠️ Potential issue | 🟡 Minor

Roadmap item contradicts this PR's deliverables.

Line 149 lists "Dockerization" as a future GSoC 2026 task, yet this PR already introduces Dockerfile and docker-compose.yml. Update the roadmap to reflect that containerization is now in place and describe what remains (e.g., local SPARQL endpoint, CI/CD image publishing).

🤖 Prompt for AI Agents

In `@README.md` around lines 145 - 150, Update the "Roadmap for GSoC 2026" section to remove the claim that Dockerization is future work and instead state that Dockerization has been implemented (referencing the added Dockerfile and docker-compose.yml), then list remaining infra tasks such as providing a local SPARQL endpoint, adding CI/CD steps to build and publish container images, and documenting how to run the stack locally and in CI.

src/validation.py

coderabbitai · 2026-02-11T06:18:44Z

src/wiki_resolver.py

+        # 1. Did we get a redirect? (e.g., UK -> United Kingdom)
+        query = data.get("query", {})
+        redirects = query.get("redirects", [])
+        final_title = redirects[0]["to"] if redirects else term
+
+        # 2. Is the final page a DISAMBIGUATION page?
+        pages = query.get("pages", {})
+        for page_id in pages:
+            if page_id == "-1": continue # Page doesn't exist
+
+            page = pages[page_id]
+            # Check if it has the 'disambiguation' property
+            is_disambig = "pageprops" in page and "disambiguation" in page["pageprops"]
+
+            if not is_disambig:
+                # SUCCESS! It's a real page (e.g., "Physics", "United Kingdom").
+                # Return immediately. Do not go to Opensearch.
+                return page["title"]
+            else:
+                # It exists, but it's a Disambiguation (e.g., "Barca").
+                # IGNORE it. Fall through to Strategy 2 to find a better suggestion.
+                pass


⚠️ Potential issue | 🟡 Minor

final_title is computed but never used — redirect target is effectively ignored.

final_title (Line 31) captures the redirect target but is never referenced. The code returns page["title"] instead, which happens to be the same value for single-title queries, but this masks the intent and would silently break if the logic were extended to multi-title lookups.

Either use final_title when returning or remove it.

Proposed fix

redirects = query.get("redirects", []) - final_title = redirects[0]["to"] if redirects else term # 2. Is the final page a DISAMBIGUATION page? pages = query.get("pages", {}) for page_id in pages: if page_id == "-1": continue # Page doesn't exist page = pages[page_id] # Check if it has the 'disambiguation' property is_disambig = "pageprops" in page and "disambiguation" in page["pageprops"] if not is_disambig: - # SUCCESS! It's a real page (e.g., "Physics", "United Kingdom"). - # Return immediately. Do not go to Opensearch. return page["title"] else: - # It exists, but it's a Disambiguation (e.g., "Barca"). - # IGNORE it. Fall through to Strategy 2 to find a better suggestion. pass

🧰 Tools

🪛 Ruff (0.15.0)

[error] 31-31: Local variable final_title is assigned to but never used

Remove assignment to unused variable final_title

(F841)

[error] 36-36: Multiple statements on one line (colon)

(E701)

🤖 Prompt for AI Agents

In `@src/wiki_resolver.py` around lines 28 - 49, The computed redirect target final_title is never used; update the non-disambiguation return to prefer the redirect target (final_title) when present (i.e., return final_title if redirects else page["title"]) inside the pages loop (where page, pages, and is_disambig are checked), or remove final_title entirely if you prefer not to support redirects—ensure the code in the loop consistently uses final_title/term semantics so redirects aren't ignored.

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@README.md`:
- Around line 73-100: The README's "Mega-Benchmark" section is inconsistent: the
summary text claims "80%+ success rate" and the heading says "The
'Mega-Benchmark' Results (20/20 Ambiguity Test)" but the results table lists 19
passing rows (100% of those); fix by either adding the missing 20th benchmark
row (with its query, resolved entity, and verdict) to the table, or adjust the
heading and summary to match the actual data (e.g., change "20/20" to "19/19"
and update "80%+ success rate" to the correct percentage), and ensure the
"Mega-Benchmark" label and success-rate sentence are updated to remain
consistent with the table entries.

In `@src/reasoning.py`:
- Around line 1-3: The matplotlib backend selection is happening after importing
pyplot, so move the call to matplotlib.use('Agg') to module import time before
importing matplotlib.pyplot (i.e., add matplotlib.use('Agg') before the line
that imports "matplotlib.pyplot as plt"), then remove the backend-selection
lines from the function _plot_graph (remove the matplotlib.use('Agg') call
inside _plot_graph and any duplicate backend-selection code around lines 68-70)
so pyplot isn't reconfigured after import.
- Around line 61-66: The _query method currently uses a bare except which hides
fatal exceptions; replace it with a specific exception handler (e.g. except
(requests.RequestException, KeyError, ValueError):) around the
requests.get(...).json() call and subsequent r['results']['bindings'] access so
network errors and JSON/key parse issues are caught but
KeyboardInterrupt/SystemExit are allowed to propagate; keep the existing
behavior of returning [] on those handled exceptions and apply this change
inside the _query function.
- Around line 6-7: The DBPEDIA endpoint is hardcoded in the Reasoning class
__init__ (self.endpoint = "https://dbpedia.org/sparql"), causing it to ignore
environment-based configuration used elsewhere (validation.py reads
DBPEDIA_ENDPOINT); change Reasoning to read the DBPEDIA_ENDPOINT environment
variable (falling back to the existing default) instead of the hardcoded string,
e.g., read os.getenv("DBPEDIA_ENDPOINT") and assign to self.endpoint so Docker
Compose and validation.py remain consistent.

In `@src/validation.py`:
- Around line 18-22: The _sanitize_uri method is too aggressive—its allowlist
regex removes valid RFC3986 characters like '%' and '~', corrupting
percent-encoded URIs (e.g. Beyonc%C3%A9); change it to a denylist approach in
_sanitize_uri that only strips or rejects characters dangerous for SPARQL (e.g.
<, >, {, }, backtick, unescaped control chars, and backslashes) while leaving
percent-encodings, reserved characters (:/?#[]@!$&'()*+,;=) and tildes intact;
implement this by replacing the current re.sub allowlist with a targeted re.sub
that removes only the forbidden characters (and/or raise a ValueError on control
characters), and optionally validate with urllib.parse.urlparse to ensure a
minimally well-formed URI in the _sanitize_uri function.

🧹 Nitpick comments (3)

src/main.py (2)
5-14: Consider lazy-loading the SentenceTransformer model.

The model is loaded eagerly in __init__, which means importing or instantiating RelationExtractor always incurs the cost of downloading/loading all-MiniLM-L6-v2. For a demo this is fine, but if this class is ever imported without being used, it will be wasteful. Also, if the model isn't cached locally, SentenceTransformer('all-MiniLM-L6-v2') will attempt a network download — consider documenting this or adding error handling for offline scenarios.

31-41: Encoding the full sentence as fallback may produce misleading predicate matches.

When subject or object isn't found (or start >= end), phrase falls back to the entire text. Encoding the full sentence and matching it against short predicate exemplars (e.g., "plays for") will likely yield low but non-zero similarity. Since best_score starts at 0.0 and cosine similarity is almost never exactly zero, the method will still return a "best" predicate even when the extraction is meaningless. Consider returning a sentinel or applying a minimum confidence threshold.
Suggested threshold check
-        return best_pred, best_score, phrase
+        if best_score < 0.3:
+            return "None", best_score, phrase
+        return best_pred, best_score, phrase
src/reasoning.py (1)

41-57: 2-hop search issues up to 50 sequential SPARQL queries — potential for slow execution and rate limiting.

Each neighbor triggers an individual _query call to the public DBpedia endpoint. With 50 neighbors, this could take 50 × 5s (timeout) = ~4 minutes worst case, and may trigger rate limiting. Consider adding an early-exit log or reducing the limit, and note this limitation in the docstring.

coderabbitai · 2026-02-11T09:53:47Z

README.md

+## 📊 Benchmarks & Robustness
+
+We stress-tested the resolver against a **Mega-Benchmark of 20 challenging entities**, covering slang, acronyms, and high-ambiguity terms. The system achieved an **80%+ success rate**, significantly outperforming standard string matching.
+
+### 🏆 The "Mega-Benchmark" Results (20/20 Ambiguity Test)
+
+| Category | Query Input | Resolved Entity | Verdict |
+| :--- | :--- | :--- | :--- |
+| **Sports Slang** | `"Barca"` | **FC Barcelona** | ✅ PASS |
+| | `"Man City"` | **Manchester City F.C.** | ✅ PASS |
+| | `"CR7"` | **Cristiano Ronaldo** | ✅ PASS |
+| | `"Real Madrid"` | **Real Madrid CF** | ✅ PASS |
+| **Tech/Context** | `"Apple"` | **Apple Inc.** (Not the fruit) | ✅ PASS |
+| | `"Amazon"` | **Amazon (Company)** (Not the river) | ✅ PASS |
+| | `"Tesla"` | **Tesla, Inc.** | ✅ PASS |
+| | `"Python"` | **Python (programming language)** | ✅ PASS |
+| **Acronyms** | `"JFK"` | **John F. Kennedy** | ✅ PASS |
+| | `"KSA"` | **Saudi Arabia** | ✅ PASS |
+| | `"NYC"` | **New York City** | ✅ PASS |
+| | `"AI"` | **Artificial Intelligence** | ✅ PASS |
+| **Geography** | `"The Nile"` | **Nile** | ✅ PASS |
+| **Figures** | `"Bill Gates"` | **Bill Gates** | ✅ PASS |
+| | `"Steve Jobs"` | **Steve Jobs** | ✅ PASS |
+| | `"Obama"` | **Barack Obama** | ✅ PASS |
+| | `"Gandhi"` | **Mahatma Gandhi** | ✅ PASS |
+| **Culture** | `"The Beatles"` | **The Beatles** | ✅ PASS |
+| | `"Game of Thrones"` | **Game of Thrones** | ✅ PASS |
+


⚠️ Potential issue | 🟡 Minor

Benchmark table claims 80%+ success rate but shows 19/19 passes (100%) — no failures listed.

Line 75 states "80%+ success rate" but the table (lines 79-99) shows all 19 entries passing (the header says "20/20" but only 19 data rows are listed). Either include failure cases to substantiate the 80% claim, add the missing 20th entry, or correct the success rate figure.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~81-~81: Did you mean “Barça” (football club in Barcelona)?
Context: ...-- | :--- | :--- | | Sports Slang | "Barca" | FC Barcelona | ✅ PASS | | | `"M...

(BARCA_DIACRITIC)

🤖 Prompt for AI Agents

In `@README.md` around lines 73 - 100, The README's "Mega-Benchmark" section is inconsistent: the summary text claims "80%+ success rate" and the heading says "The 'Mega-Benchmark' Results (20/20 Ambiguity Test)" but the results table lists 19 passing rows (100% of those); fix by either adding the missing 20th benchmark row (with its query, resolved entity, and verdict) to the table, or adjust the heading and summary to match the actual data (e.g., change "20/20" to "19/19" and update "80%+ success rate" to the correct percentage), and ensure the "Mega-Benchmark" label and success-rate sentence are updated to remain consistent with the table entries.

src/reasoning.py

src/validation.py

…rks)

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/validation.py`:
- Around line 18-32: The _sanitize_uri currently removes all whitespace (via \s)
which collapses multi-word labels and breaks DBpedia URIs; update _sanitize_uri
to first replace whitespace with underscores (e.g., convert any \s+ to '_') and
then run the existing regex cleanup to strip illegal SPARQL characters, so
verify_relationship's subj_clean and obj_clean will form valid DBpedia resource
URIs (e.g., "Cristiano Ronaldo" -> "Cristiano_Ronaldo" ->
http://dbpedia.org/resource/Cristiano_Ronaldo"); keep the existing logic in
verify_relationship that prepends the DBpedia prefix when the cleaned value does
not start with "http".

🧹 Nitpick comments (3)

src/validation.py (1)
40-40: Split into two lines for readability.

Static analysis flags E701 (multiple statements on one line).
Proposed fix
-            if response.status_code != 200: return False, []
+            if response.status_code != 200:
+                return False, []
src/reasoning.py (2)
71-90: _plot_graph always overwrites reasoning_output.png — consider unique filenames.

If find_path is called multiple times in a session (or by concurrent users), each call silently overwrites the previous output. This is a minor concern for a prototype but worth noting.

48-48: Style: multiple statements on one line (E701).
Proposed fix
-            if "http://dbpedia.org/resource/" not in mid_uri: continue
+            if "http://dbpedia.org/resource/" not in mid_uri:
+                continue

src/validation.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/reasoning.py`:
- Around line 65-70: In _query, avoid calling .json() on the requests.get()
return without checking the HTTP response: call requests.get(...) into a
variable (e.g., resp), use resp.raise_for_status() or verify resp.status_code is
2xx before parsing, then call resp.json() and extract r['results']['bindings'];
update the except block to separately handle requests.RequestException and
JSON/KeyError (so you don't swallow HTTP errors) and return [] on failure — this
change should be made inside the _query method.

src/reasoning.py

mommi84

Thanks for your PR!

Please take a look at the current state of the project for the English language, which can be found in the GSoC25 folder.

I personally liked the idea of exploiting wiki-links to compute the relatedness between two entities in the text. However, many of the proposed features in this PR already exist in the current project.

Please go through the comments and the existing codebase before resubmitting your PR, or write a project proposal for GSoC 2026 specifying your expected roadmap to address the outstanding issues.

mommi84 · 2026-02-11T15:56:23Z

assets/benchmark.png

AI and Amazon look wrong too, so the final score should actually be 6/9, not 8/9.

mommi84 · 2026-02-11T16:00:56Z

src/dbpedia_linker.py

+# IMPORT YOUR RESOLVER FROM THE OTHER FILE
+from src.wiki_resolver import resolve_redirect
+
+DBPEDIA_LOOKUP = "https://lookup.dbpedia.org/api/search"


No need to use DBpedia Lookup. In this project, we indexed Wikipedia anchors into a Redis database. We query that to retrieve a list of entities given an input string.

See e.g. https://github.com/dbpedia/neural-extraction-framework/blob/main/GSoC25/EntityLinking/test_redis.py

mommi84 · 2026-02-11T16:01:23Z

src/dbpedia_linker.py

+DBPEDIA_LOOKUP = "https://lookup.dbpedia.org/api/search"
+
+# Weights Configuration
+TYPE_PRIOR = {


How were these weights found?

mommi84 · 2026-02-11T16:02:24Z

src/dbpedia_linker.py

+    "Country": 1.7, "Album": 1.5, "Film": 1.4, "Work": 1.2,
+    "Event": 0.3, "Season": 0.05, "Tournament": 0.2
+}
+STOPWORDS = {"fc", "club", "sc", "the", "of", "and", "in", "inc"}


Why do stopwords include important words such as 'club' or 'fc'/'inc' which can distinguish, e.g., a football club from its city?

mommi84 · 2026-02-11T16:03:20Z

src/linker.py

+    print("⚠️ Warning: Spacy model 'en_core_web_sm' not found. Run: python -m spacy download en_core_web_sm")
+    nlp = None
+
+WIKIPEDIA_API = "https://en.wikipedia.org/w/api.php"


Again, no need to use the Wikipedia API. In this project, we indexed Wikipedia redirects into a Redis database. We query that to retrieve page redirects too.

See e.g. https://github.com/dbpedia/neural-extraction-framework/blob/main/GSoC25/EntityLinking/test_redis.py

mommi84 · 2026-02-11T16:04:08Z

src/main.py

+    def __init__(self):
+        self.model = SentenceTransformer('all-MiniLM-L6-v2')
+        self.predicates = {
+            "dbo:team": ["plays for", "signed with", "club", "team"],


Why were these hard-coded?

NakulSingh156 · 2026-02-12T06:55:51Z

Thanks for your PR!

Please take a look at the current state of the project for the English language, which can be found in the GSoC25 folder.

I personally liked the idea of exploiting wiki-links to compute the relatedness between two entities in the text. However, many of the proposed features in this PR already exist in the current project.

Please go through the comments and the existing codebase before resubmitting your PR, or write a project proposal for GSoC 2026 specifying your expected roadmap to address the outstanding issues.

Thanks a lot for the detailed review @mommi84! This is super helpful.

I’ve noted all your points and am working on the fixes:

Redis vs APIs: I actually missed the test_redis.py infrastructure in GSoC25. Using the internal Redis index is definitely better than calling public APIs. I’ll refactor the linker to follow that pattern.
Benchmarks: Good catch on 'AI' and 'Amazon'. I was being a bit too optimistic there. I’ll correct the score to 6/9.
Weights & Stopwords: The current weights were just based on my initial trial-and-error, I used to patch specific errors (like distinguishing "La Liga" the league vs the season). I see now that hardcoding them makes the system brittle, and removing 'FC'/'Club' was a mistake (since they distinguish the team from the city). I’ll revert to a cleaner, data-driven approach and fix the stopword list.
Main.py Predicates: I hardcoded those mappings solely to create a lightweight demo that runs without downloading the full 3GB extraction model. I will replace that dictionary with a proper ModelInterface so the system can load the real model (or a smaller distilled version) instead of relying on fixed strings.

I’ll push the changes shortly!

…bolic pipeline

coderabbitai

Actionable comments posted: 7

🤖 Fix all issues with AI agents

In `@src/linker.py`:
- Around line 28-31: The fallback in resolve_entity returns (resolved_term,
"Unresolved") which yields a string for the types component; change this to
return a list to match the happy path (e.g., (resolved_term, []) or
(resolved_term, ["Unresolved"])) so the second element is always a list coming
from get_type; update the return in the if not candidates block in
resolve_entity accordingly and ensure any callers (e.g., places that ignore the
second value) still handle a list.

In `@src/main.py`:
- Around line 72-74: The fallback assignment is dead because resolve_entity
never returns a falsy value, so subj_final = subj_linked if subj_linked else
subj_raw and obj_final = obj_linked if obj_linked else obj_raw are redundant;
fix by either (A) simplifying to subj_final = subj_linked and obj_final =
obj_linked, or (B) change resolve_entity to return None on unresolved lookups
and keep the conditional, or (C) change the condition to explicitly detect
unresolved marker (e.g., subj_linked == "" or subj_linked == UNRESOLVED)
depending on which behavior you intend—update resolve_entity and the assignments
(references: resolve_entity, subj_linked, subj_raw, subj_final, obj_linked,
obj_raw, obj_final) accordingly.
- Around line 15-24: The entity spans returned by extract_entity_spans (called
from extract in main.py) are non-deterministic because they originate from a
set; update the flow to preserve sentence order by sorting spans by their
character start index before using them as subj/obj. Either fix
extract_entity_spans in linker.py to return a list ordered by span start, or in
main.py's extract method take the returned spans and sort them by their position
in the input text (e.g., by a start offset attribute or by finding
index(text.find(span))). Then use the first sorted span as subj and the last as
obj so subject/object assignment is deterministic.

In `@src/seed_redis.py`:
- Around line 5-6: The script currently hardcodes port 6379 when creating the
Redis client (variable r in seed_redis.py) which is inconsistent with
RedisClient that reads REDIS_PORT; update seed_redis.py to read REDIS_PORT from
the environment (e.g., port_str = os.getenv("REDIS_PORT", "6379")), convert it
to an int, and pass that int as the port argument when constructing
redis.Redis(host=host, port=port, decode_responses=True) so the script honors
the same environment configuration as RedisClient.

In `@src/validation.py`:
- Around line 11-16: The list weak_predicates is defined but never used; either
delete the attribute or integrate it into the validator by filtering out those
predicates when producing/assessing Tier 2 results. Update the validation logic
(where Tier 2 or predicate-level filtering occurs) to treat any predicate in
self.weak_predicates as "weak" — e.g., build a set from self.weak_predicates and
skip or downgrade triples whose predicate is in that set before returning Tier 2
findings (or alternatively remove the weak_predicates attribute if you choose
not to use it).
- Around line 44-48: Remove the dead requests.Session() and the informal
comments around it: the unused attribute self.sparql = requests.Session() and
the "Wait, ..." commentary should be deleted; instead, keep and rely on the
existing use of requests.get() and the self.endpoint variable in the validation
logic (search for self.sparql, requests.Session, and the surrounding try block
in validation.py) so there is no unused session attribute or developer-note text
left in production code.
- Around line 68-72: The __main__ block calls a non-existent method
verify_relationship on FactChecker and will raise AttributeError; update the
__main__ to call the existing validate_triple method (or add a wrapper named
verify_relationship that delegates to validate_triple). Locate the FactChecker
class and the top-level __main__ snippet that calls verify_relationship, then
either replace verify_relationship("Cristiano_Ronaldo", "Real_Madrid_CF") with
validate_triple("Cristiano_Ronaldo", "Real_Madrid_CF") or implement a
verify_relationship(self, subj, obj) method that calls and returns
validate_triple(subj, obj) so the script runs without error.

🧹 Nitpick comments (7)

src/seed_redis.py (1)

8-10: flushdb() is a destructive operation with no safeguard.

In a prototype this is acceptable, but if this script is ever run against a shared or non-ephemeral Redis instance, it will wipe all data. Consider guarding with an env flag (e.g., ALLOW_FLUSH=true) or at least logging a prominent warning before flushing.

docker-compose.yml (1)

1-1: version key is obsolete in modern Docker Compose.

Docker Compose V2 ignores the version field and emits a warning. It can be safely removed.

src/redis_client.py (1)

4-8: No error handling for Redis connection failure.

If Redis is unavailable, the redis.Redis() constructor won't fail, but the first command will raise redis.ConnectionError. Callers like HybridLinker have no try/except around Redis operations, so a connection issue will crash the pipeline with an unhandled exception.

Consider adding a connection check (e.g., self.client.ping()) in __init__ with a clear error message, or document that callers must handle redis.ConnectionError.
src/linker.py (2)
49-53: HybridLinker (and thus RedisClient) is re-instantiated on every call.

get_best_entity creates a new HybridLinker → new RedisClient → new Redis connection per invocation. In a loop (like Pipeline.run), this is wasteful. Use a module-level singleton or accept a linker instance.
Proposed fix
+_linker = None
+
 def get_best_entity(term, sentence_context=None):
-    linker = HybridLinker()
+    global _linker
+    if _linker is None:
+        _linker = HybridLinker()
+    linker = _linker
     ent, types = linker.resolve_entity(term, sentence_context)
     return ent, types
8-12: Bare except silently swallows all exceptions.

This catches KeyboardInterrupt, SystemExit, etc. Catch OSError or Exception at minimum.
Proposed fix
 try:
     nlp = spacy.load("en_core_web_sm")
-except:
+except OSError:
     print("⚠️ Warning: Spacy model 'en_core_web_sm' not found. Run: python -m spacy download en_core_web_sm")
     nlp = None
src/main.py (1)

39-39: Move import to the top of the file.

The from validation import FactChecker import is placed mid-file between two class definitions. This reduces readability and is non-idiomatic.

src/reasoning.py (1)

44-61: N+1 SPARQL query pattern in 2-hop BFS — up to 50 additional round-trips.

Each neighbor triggers an individual SPARQL query to check connectivity to the end entity. With LIMIT 50 neighbors, this can issue up to 50 sequential HTTP requests with 5s timeouts each (worst case: ~250s). For a prototype this may be acceptable, but consider batching candidates into a single VALUES-based query if performance becomes an issue.

coderabbitai · 2026-02-14T04:48:00Z

src/linker.py

+        if not candidates:
+            # Fallback: If not in Redis, return the term itself as a best-effort guess
+            # This ensures we don't crash on terms we didn't seed
+            return resolved_term, "Unresolved"


⚠️ Potential issue | 🟠 Major

Inconsistent return type: str vs list for the types component.

When no candidates are found, resolve_entity returns (resolved_term, "Unresolved") (a string), but the happy path on line 47 returns (best_match, types) where types is a list from get_type. In main.py line 69-70, the type is discarded (_), so it doesn't crash today, but any future caller expecting a consistent type will break.

Proposed fix

if not candidates: - return resolved_term, "Unresolved" + return resolved_term, ["Unresolved"]

🤖 Prompt for AI Agents

In `@src/linker.py` around lines 28 - 31, The fallback in resolve_entity returns (resolved_term, "Unresolved") which yields a string for the types component; change this to return a list to match the happy path (e.g., (resolved_term, []) or (resolved_term, ["Unresolved"])) so the second element is always a list coming from get_type; update the return in the if not candidates block in resolve_entity accordingly and ensure any callers (e.g., places that ignore the second value) still handle a list.

coderabbitai · 2026-02-14T04:48:00Z

src/main.py

+    def extract(self, text):
+        # 1. Heuristic Entity Extraction (for demo purposes)
+        # We assume the sentence has at least two entities.
+        spans = extract_entity_spans(text)
+        if len(spans) < 2: 
+             return "Unknown", "Unknown", "dbo:related", text
+
+        # Heuristic: First entity is Subject, Last is Object
+        # This handles "The Ballon d'Or was won by Lionel Messi" -> spans=["Ballon d'Or", "Lionel Messi"]
+        subj, obj = spans[0], spans[-1]


⚠️ Potential issue | 🟠 Major

Entity span ordering is non-deterministic — subj and obj may swap between runs.

extract_entity_spans in linker.py collects spans into a set, then returns list(normalized). Set iteration order in Python is not guaranteed to match the order spans appear in the sentence. The heuristic "first entity is Subject, last is Object" (line 24) therefore produces unpredictable results.

The spans should preserve sentence order. For example, sort by their position in the input text:

Proposed fix (in extract or at call site)

spans = extract_entity_spans(text) + # Sort spans by their position in the text for stable subj/obj assignment + spans.sort(key=lambda s: text.find(s)) if len(spans) < 2:

🤖 Prompt for AI Agents

In `@src/main.py` around lines 15 - 24, The entity spans returned by extract_entity_spans (called from extract in main.py) are non-deterministic because they originate from a set; update the flow to preserve sentence order by sorting spans by their character start index before using them as subj/obj. Either fix extract_entity_spans in linker.py to return a list ordered by span start, or in main.py's extract method take the returned spans and sort them by their position in the input text (e.g., by a start offset attribute or by finding index(text.find(span))). Then use the first sorted span as subj and the last as obj so subject/object assignment is deterministic.

coderabbitai · 2026-02-14T04:48:00Z

src/main.py

+            # Safety Fallback: If Redis knows it, use the linked version. Otherwise, try the raw string.
+            subj_final = subj_linked if subj_linked else subj_raw
+            obj_final = obj_linked if obj_linked else obj_raw


⚠️ Potential issue | 🟡 Minor

Dead-code fallback: resolve_entity always returns a non-empty string.

resolve_entity returns either the Redis-resolved term or the original input — never an empty/falsy value. The condition subj_linked if subj_linked else subj_raw will always select subj_linked, making the fallback unreachable.

🤖 Prompt for AI Agents

In `@src/main.py` around lines 72 - 74, The fallback assignment is dead because resolve_entity never returns a falsy value, so subj_final = subj_linked if subj_linked else subj_raw and obj_final = obj_linked if obj_linked else obj_raw are redundant; fix by either (A) simplifying to subj_final = subj_linked and obj_final = obj_linked, or (B) change resolve_entity to return None on unresolved lookups and keep the conditional, or (C) change the condition to explicitly detect unresolved marker (e.g., subj_linked == "" or subj_linked == UNRESOLVED) depending on which behavior you intend—update resolve_entity and the assignments (references: resolve_entity, subj_linked, subj_raw, subj_final, obj_linked, obj_raw, obj_final) accordingly.

coderabbitai · 2026-02-14T04:48:00Z

src/seed_redis.py

+    host = os.getenv("REDIS_HOST", "localhost")
+    r = redis.Redis(host=host, port=6379, decode_responses=True)


⚠️ Potential issue | 🟡 Minor

REDIS_PORT environment variable is ignored here but used elsewhere.

RedisClient in src/redis_client.py reads REDIS_PORT from the environment, and docker-compose.yml sets it. This script hardcodes port 6379, creating an inconsistency if the port is ever changed.

Proposed fix

def seed_data(): host = os.getenv("REDIS_HOST", "localhost") - r = redis.Redis(host=host, port=6379, decode_responses=True) + port = int(os.getenv("REDIS_PORT", 6379)) + r = redis.Redis(host=host, port=port, decode_responses=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

host = os.getenv("REDIS_HOST", "localhost")

r = redis.Redis(host=host, port=6379, decode_responses=True)

host = os.getenv("REDIS_HOST", "localhost")

port = int(os.getenv("REDIS_PORT", 6379))

r = redis.Redis(host=host, port=port, decode_responses=True)

🤖 Prompt for AI Agents

In `@src/seed_redis.py` around lines 5 - 6, The script currently hardcodes port 6379 when creating the Redis client (variable r in seed_redis.py) which is inconsistent with RedisClient that reads REDIS_PORT; update seed_redis.py to read REDIS_PORT from the environment (e.g., port_str = os.getenv("REDIS_PORT", "6379")), convert it to an int, and pass that int as the port argument when constructing redis.Redis(host=host, port=port, decode_responses=True) so the script honors the same environment configuration as RedisClient.

coderabbitai · 2026-02-14T04:48:00Z

src/validation.py

+        self.weak_predicates = [
+            "http://dbpedia.org/ontology/wikiPageWikiLink",
+            "http://dbpedia.org/property/wikiPageUsesTemplate",
+            "http://www.w3.org/2002/07/owl#sameAs",
+            "http://purl.org/dc/terms/subject"
+        ]


⚠️ Potential issue | 🟡 Minor

weak_predicates is defined but never referenced.

This list is unused in any method. Either remove it or wire it into the validation logic (e.g., filtering Tier 2 results).

🤖 Prompt for AI Agents

In `@src/validation.py` around lines 11 - 16, The list weak_predicates is defined but never used; either delete the attribute or integrate it into the validator by filtering out those predicates when producing/assessing Tier 2 results. Update the validation logic (where Tier 2 or predicate-level filtering occurs) to treat any predicate in self.weak_predicates as "weak" — e.g., build a set from self.weak_predicates and skip or downgrade triples whose predicate is in that set before returning Tier 2 findings (or alternatively remove the weak_predicates attribute if you choose not to use it).

coderabbitai · 2026-02-14T04:48:00Z

src/validation.py

+        try:
+            # Check Tier 1
+            self.sparql = requests.Session() # Mocking sparql wrapper behavior strictly with requests
+            # Wait, user code used self.sparql.setQuery which implies SPARQLWrapper but we are using requests.
+            # I must adapt the logic to use self.endpoint and requests as established before.


🛠️ Refactor suggestion | 🟠 Major

Dead code and leftover thinking-out-loud comments.

Lines 46–48 create an unused requests.Session() and contain stream-of-consciousness comments that should not be in production code. The session is never used — the actual requests on lines 52 and 58 use requests.get() directly.

Proposed fix

try: - # Check Tier 1 - self.sparql = requests.Session() # Mocking sparql wrapper behavior strictly with requests - # Wait, user code used self.sparql.setQuery which implies SPARQLWrapper but we are using requests. - # I must adapt the logic to use self.endpoint and requests as established before. - - # ADAPTING TO REQUESTS: + # Check Tier 1 (Strict Ontology Match) params = {"query": strict_query, "format": "json"}

🤖 Prompt for AI Agents

In `@src/validation.py` around lines 44 - 48, Remove the dead requests.Session() and the informal comments around it: the unused attribute self.sparql = requests.Session() and the "Wait, ..." commentary should be deleted; instead, keep and rely on the existing use of requests.get() and the self.endpoint variable in the validation logic (search for self.sparql, requests.Session, and the surrounding try block in validation.py) so there is no unused session attribute or developer-note text left in production code.

coderabbitai · 2026-02-14T04:48:00Z

src/validation.py

+if __name__ == "__main__":
+    checker = FactChecker()
+    print("Testing FactChecker: Ronaldo -> Real Madrid...")
+    exists, preds = checker.verify_relationship("Cristiano_Ronaldo", "Real_Madrid_CF")
+    print(f"Verified: {exists}, Predicates: {preds}")


⚠️ Potential issue | 🔴 Critical

verify_relationship does not exist — the __main__ block will crash.

The class defines validate_triple, not verify_relationship. Running this file directly will raise AttributeError.

Proposed fix

if __name__ == "__main__": checker = FactChecker() print("Testing FactChecker: Ronaldo -> Real Madrid...") - exists, preds = checker.verify_relationship("Cristiano_Ronaldo", "Real_Madrid_CF") - print(f"Verified: {exists}, Predicates: {preds}") + result = checker.validate_triple("Cristiano_Ronaldo", "dbo:team", "Real_Madrid_CF") + print(f"Result: {result}")

🤖 Prompt for AI Agents

In `@src/validation.py` around lines 68 - 72, The __main__ block calls a non-existent method verify_relationship on FactChecker and will raise AttributeError; update the __main__ to call the existing validate_triple method (or add a wrapper named verify_relationship that delegates to validate_triple). Locate the FactChecker class and the top-level __main__ snippet that calls verify_relationship, then either replace verify_relationship("Cristiano_Ronaldo", "Real_Madrid_CF") with validate_triple("Cristiano_Ronaldo", "Real_Madrid_CF") or implement a verify_relationship(self, subj, obj) method that calls and returns validate_triple(subj, obj) so the script runs without error.

NakulSingh156 · 2026-02-14T09:03:46Z

Thanks for your PR!

Please take a look at the current state of the project for the English language, which can be found in the GSoC25 folder.

I personally liked the idea of exploiting wiki-links to compute the relatedness between two entities in the text. However, many of the proposed features in this PR already exist in the current project.

Please go through the comments and the existing codebase before resubmitting your PR, or write a project proposal for GSoC 2026 specifying your expected roadmap to address the outstanding issues.

Hi @mommi84! Thanks again for the detailed feedback! I've just pushed a major architectural overhaul that addresses all the points:

Redis vs APIs: Fully migrated to a local Redis cache. I added a docker-compose.yml and seed_redis.py to handle the infrastructure offline, perfectly aligning with the GSoC25 test_redis.py pattern.

Benchmarks: You were right about the scoring—I've corrected it. I also expanded the testing into a 20-item "Mega-Benchmark" (documented in the README) to better track edge cases moving forward.

Weights & Stopwords: Removed the hardcoded 0.7 * fuzz logic and stopword lists in favor of a cleaner, type-aware resolution flow in the HybridLinker.

Main.py Predicates: Implemented a proper ModelInterface abstraction. The system now loads all-MiniLM-L6-v2 to calculate semantic similarity instead of relying on fixed dictionary lookups.

I've also added a Neuro-Symbolic validation layer to act as a guardrail against hallucinations. Let me know if the new architecture looks solid or if you'd like any further tweaks!

Feat: Complete Neuro-Symbolic Pipeline with Docker Infra

a760087

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

Fix: Address CodeRabbit review (Security, Docker loop, Validation)

0354ab5

NakulSingh156 force-pushed the feature/docker-final branch from a2286d3 to 0354ab5 Compare February 11, 2026 09:46

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

Fix: Apply CodeRabbit feedback (Matplotlib, URI sanitization, Benchma…

4b0c656

…rks)

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

src/validation.py Outdated Show resolved Hide resolved

Fix: Round 3 CodeRabbit feedback (URI spaces and style)

31d5e20

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

src/reasoning.py Show resolved Hide resolved

Fix: Round 4 CodeRabbit feedback (Safer JSON parsing)

6531d5b

mommi84 requested changes Feb 11, 2026

View reviewed changes

mallasiddharthreddy mentioned this pull request Feb 13, 2026

Added missing imports, cleaned up unused types and fixed indentations #27

Open

feat: Integrate Redis architecture, mock data fixtures, and Neuro-Sym…

1963479

…bolic pipeline

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

refactor: Apply CodeRabbit review fixes (ports, error handling, cleanup)

d586bfc

		host = os.getenv("REDIS_HOST", "localhost")
		r = redis.Redis(host=host, port=6379, decode_responses=True)

Conversation

NakulSingh156 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture Overhaul: Neuro-Symbolic Pipeline & Docker Infrastructure

Motivation

Core Architectural Updates

Neuro-Symbolic Features

CI/CD & Code Quality

Proof of Build (End-to-End Pipeline Validation)

Things Addressed in this PR:

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mommi84 left a comment

Choose a reason for hiding this comment

Uh oh!

mommi84 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

mommi84 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

mommi84 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

mommi84 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

mommi84 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

mommi84 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

NakulSingh156 commented Feb 12, 2026

Uh oh!

NakulSingh156 commented Feb 11, 2026 •

edited

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading