Skip to content

Replaced brittle prompt with pydantic schema validation#30

Open
Nossks wants to merge 1 commit intodbpedia:mainfrom
Nossks:fix/schema-validation
Open

Replaced brittle prompt with pydantic schema validation#30
Nossks wants to merge 1 commit intodbpedia:mainfrom
Nossks:fix/schema-validation

Conversation

@Nossks
Copy link

@Nossks Nossks commented Jan 29, 2026

This PR replaces the legacy index-based prompt selection logic in disambiguate_triple with Gemini's native Structured Output (response_schema). It also introduces a dedicated output_schema.py for type definitions and fixes a critical data integrity issue where invalid predicates defaulted to allowed[0].

Summary by CodeRabbit

  • New Features

    • Added a structured response schema for disambiguation to enable safer, typed handling of model outputs.
  • Improvements

    • Stricter, schema-based validation of disambiguation responses for more reliable results.
    • Robust fallback handling: invalid or unparsable responses reset selections to safe defaults.
    • Invalid predicates are explicitly rejected instead of defaulting, reducing incorrect choices.
    • Safeguards to prevent subject/object role swaps and resolve URI collisions.
    • Metadata now labels rejected hallucinations vs valid candidates more clearly.
    • More descriptive prompts and clearer display of allowed predicates and candidate lists.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

Replaces ad-hoc JSON parsing with a Pydantic DisambiguationResult schema and validates model responses via model_validate_json. Introduces generation_config, typed field access, predicate validation/fallback, index clamping, URI collision/role-swap guards, updated prompt wording, and meta labeling for hallucination handling. (≤50 words)

Changes

Cohort / File(s) Summary
Schema Definition
GSoC25/NEF/output_schema.py
Adds DisambiguationResult Pydantic model with subject_index, predicate_uri, and object_index fields.
Disambiguation Logic Refactor
GSoC25/NEF/NEF.py
Removes internal _get_json() helper. Introduces generation_config passed to the model, uses DisambiguationResult.model_validate_json() to parse responses, switches to typed field access, adds try/except to reset predicate/indices on validation errors, clamps indices, validates predicate against allowed set, maps indices to URIs with guards for role-swaps and identical URIs, and updates prompt wording and meta labels ("candidate" / "hallucination_rejected").

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Disambiguator as LLMDisambiguator
  participant LLM as LanguageModel
  participant Validator as DisambiguationResult
  participant Repo as CandidatePools

  User->>Disambiguator: request disambiguation (prompt + candidate pools)
  Disambiguator->>LLM: generate response (with generation_config & schema)
  LLM-->>Disambiguator: JSON response
  Disambiguator->>Validator: DisambiguationResult.model_validate_json(response)
  alt validation succeeds
    Validator-->>Disambiguator: typed fields (subject_index, predicate_uri, object_index)
    Disambiguator->>Repo: map indices → URIs (clamp & collision checks)
    Disambiguator->>Disambiguator: detect role-swaps/duplicates, adjust
    Disambiguator-->>User: result (meta: "candidate")
  else validation fails
    Validator-->>Disambiguator: validation error
    Disambiguator->>Disambiguator: set predicate=None, subject_index=0, object_index=0
    Disambiguator->>Repo: fallback mapping (safe defaults)
    Disambiguator-->>User: result (meta: "hallucination_rejected")
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Replaced brittle prompt with pydantic schema validation' accurately summarizes the main change: replacing brittle prompt logic with Pydantic schema validation to ensure type safety and data integrity.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@GSoC25/NEF/NEF.py`:
- Around line 387-390: The meta dictionary sets label incorrectly as
"hallucinaiton_rejected"; update the string to the correct spelling
"hallucination_rejected" where meta is created (look for the meta = {...}
assignment that uses chosen_sim, sim_map, pred_uri and sets "label") so
downstream filters get the correct label.
- Around line 357-360: Wrap the call to
DisambiguationResult.model_validate_json(resp.text) and the subsequent accesses
to data.predicate_uri / data.subject_index / data.object_index in a try/except
that catches ValidationError and ValueError; on exception log the error (using
the existing logger in this module or self.logger), and return or yield a safe
fallback (e.g., None or a sentinel) so the pipeline can continue instead of
crashing. Ensure the try block contains the lines that assign to data, pred_uri,
s_idx, o_idx, and the except block sets pred_uri, s_idx, o_idx to safe defaults
and emits a clear log message including resp.text or the exception for
debugging.
🧹 Nitpick comments (1)
GSoC25/NEF/NEF.py (1)

12-12: If GSoC25/NEF becomes a package, the import will need to be relative.

The absolute import from output_schema import DisambiguationResult currently works because output_schema.py is in the same directory. However, if GSoC25/NEF is converted to a package (by adding __init__.py to match the pattern of other GSoC25 submodules), this import will fail when the module is imported as GSoC25.NEF.NEF. Convert to a relative import and add __init__.py:

Suggested change (if package usage is intended)
-from output_schema import DisambiguationResult
+from .output_schema import DisambiguationResult

Also add an empty GSoC25/NEF/__init__.py file.

@Nossks Nossks force-pushed the fix/schema-validation branch from 9765fdf to 28ff871 Compare January 29, 2026 14:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@GSoC25/NEF/NEF.py`:
- Line 12: NEF.py currently uses a relative import "from .output_schema import
DisambiguationResult" which fails when run as a script; change it to a safe
import that works in both package and script contexts by replacing the relative
import with a fallback/absolute form—for example try importing
DisambiguationResult from output_schema first (or use an absolute import path)
and fall back to the package-style import in an except ImportError block; update
the import that references DisambiguationResult/output_schema accordingly so CLI
execution (python NEF/NEF.py ...) no longer raises "attempted relative import
with no known parent package".

@Nossks Nossks force-pushed the fix/schema-validation branch from 28ff871 to c9c5be9 Compare January 29, 2026 16:52
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@GSoC25/NEF/NEF.py`:
- Around line 348-357: The generation_config is currently a plain dict but must
be an instance of types.GenerateContentConfig; replace the dict assigned to
generation_config with
types.GenerateContentConfig(response_mime_type="application/json",
response_schema=DisambiguationResult) and pass that object into the call to
self.client.models.generate_content (the model=self.model_name, contents=prompt
call remains the same).

@Nossks Nossks force-pushed the fix/schema-validation branch from c9c5be9 to f1908da Compare January 29, 2026 17:08
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@GSoC25/NEF/NEF.py`:
- Around line 348-351: Replace the incorrect parameter name response_scheme with
response_schema when constructing the types.GenerateContentConfig so the
structured Pydantic output (DisambiguationResult) is honored; update the
generation_config assignment in NEF.py (the types.GenerateContentConfig call
that sets response_mime_type and the Pydantic class) to use
response_schema=DisambiguationResult.

@Nossks Nossks force-pushed the fix/schema-validation branch from f1908da to 606d7e9 Compare January 29, 2026 17:16
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@GSoC25/NEF/NEF.py`:
- Around line 371-375: Replace the broad except block that catches Exception
around the validation logic with a specific except for ValidationError and
ensure you use the existing _safe_print/verbose pattern for logging instead of
print; for example, in the try/except that currently sets pred_uri=None,
s_idx=0, o_idx=0, change the handler to except ValidationError as e and call
_safe_print(f"validation error: {e}", verbose) (or the module's standard logging
helper), then preserve setting pred_uri, s_idx, o_idx as before. Also add an
import for ValidationError alongside DisambiguationResult in the module imports
so the exception type is available.

@Nossks Nossks force-pushed the fix/schema-validation branch from 606d7e9 to ea527ae Compare January 29, 2026 17:30
@Nossks
Copy link
Author

Nossks commented Feb 3, 2026

Hi there, just a quick update:

All CI/CD checks are now passing (I resolved the import issues for both CLI and Package modes).

The new DisambiguationResult schema is successfully catching and rejecting hallucinations locally.

I know this touches the core generation logic, so I am happy to make adjustments if you prefer a different fallback strategy. Ready for review whenever you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant