-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
The pipeline's 12 LLM-extracted entity types (Concept, Challenge, Artifact, etc.) do not include a description property in the extraction schema (extraction/schema.py). As a result:
- No entities have descriptions — The
LLMEntityRelExtractoronly extracts properties defined in the schema. Without adescriptionproperty, the LLM extractsname,display_name, and type-specific fields but never a description. - EntitySummarizer is a no-op — The summarizer (
postprocessing/entity_summarizer.py) consolidates fragmented descriptions into coherent summaries, but finds zero entities to process because thedescriptionproperty doesn't exist in the database. - Community summaries are shallower —
CommunitySummarizerattempts to readn.descriptionfor richer community summaries but falls back to names and labels only, producing less informative results. - RAG retrieval misses semantic context — Entity matching currently relies on
nameonly. A description like "the practice of linking requirements to downstream artifacts to ensure completeness and enable impact analysis" would give retrievers far richer semantic context.
Evidence from staging pipeline run (2026-02-24)
WARNING:neo4j.notifications: The property `description` does not exist in database `neo4j`.
2026-02-24 16:29:46 [info] No entities with fragmented descriptions found
Summarized 0 entities
The Neo4j warning confirms that no entity in the graph has a description property. The EntitySummarizer correctly returns immediately with zero work.
Verification query:
MATCH (n:__Entity__) WHERE n.description IS NOT NULL RETURN count(n)
-- Returns 0Root Cause
In src/graphrag_kg_pipeline/extraction/schema.py, the NODE_TYPES dict defines properties for each entity type. None of the 12 types include a description property:
- Concept:
name,display_name,definition,aliases - Challenge:
name,display_name,severity - Artifact:
name,display_name,artifact_type - Bestpractice:
name,display_name,rationale - Processstage:
name,display_name,sequence - Role:
name,display_name,responsibilities - Standard:
name,display_name,organization,domain - Tool:
name,display_name,vendor,tool_type - Methodology:
name,display_name,approach - Industry:
name,display_name,regulated - Organization:
name,display_name,organization_type,domain - Outcome:
name,display_name,outcome_type
The SimpleKGPipeline LLM extraction prompt only asks the LLM to extract properties listed in the schema, so descriptions are never produced.
Proposed Solution
Add a description property to all 12 entity types in NODE_TYPES:
"description": {
"type": "STRING",
"required": False,
"description": "One-sentence description of this entity in the context of requirements management",
},Downstream effects (already implemented, will activate automatically)
- EntitySummarizer — Will find entities with multi-fragment descriptions (>200 chars from multi-chunk extraction) and consolidate them via LLM into clean 1-3 sentence summaries. Currently implemented and tested but has zero work to do.
- CommunitySummarizer — Already reads
n.descriptionin its community member query. Will produce richer community summaries without code changes. - API repo retrieval —
text2cypher.pyand entity search will benefit from richer entity context. No changes needed in the API repo.
Cost and runtime impact
- Extraction: ~10-20% more tokens per article (LLM must extract an additional property per entity). Estimated additional cost: ~$1-2 on a full pipeline run.
- EntitySummarizer: Will now make LLM calls for entities with fragmented descriptions. Estimated: ~$0.50-1.00 (gpt-4o, ~100 entities with fragments).
- Total additional cost per full run: ~$1.50-3.00 (on top of existing ~$9-17).
- No additional runtime for community embeddings or vector indexes — descriptions don't affect those.
Implementation steps
- Add
descriptionproperty to all 12 entity types inextraction/schema.py - Verify
EntitySummarizeractivates on a staging run (should find entities with >200 char descriptions) - Compare community summary quality with/without entity descriptions
- Update tests if schema property counts change in assertions
- Run full staging pipeline to validate end-to-end
What NOT to change
- EntitySummarizer code — Already correctly implemented, just needs data to work with
- CommunitySummarizer code — Already reads descriptions with graceful fallback
- Extraction prompts —
SimpleKGPipelineauto-generates prompts from the schema; adding the property is sufficient - Validation queries — No description-related checks currently exist
Context
- Concept has a
definitionproperty (specific to Concept), butdescriptionis a general-purpose field for all entity types - The
definitionproperty on Concept serves a different purpose — it captures formal definitions from the glossary, not contextual descriptions from extraction - This was identified during the first staging pipeline run against
graphrag-api-db-stage(local Neo4j Desktop instance)
Labels
Enhancement, Pipeline
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels