Skip to content

feat: add entity description property to extraction schema #49

@arthurfantaci

Description

@arthurfantaci

Problem

The pipeline's 12 LLM-extracted entity types (Concept, Challenge, Artifact, etc.) do not include a description property in the extraction schema (extraction/schema.py). As a result:

  1. No entities have descriptions — The LLMEntityRelExtractor only extracts properties defined in the schema. Without a description property, the LLM extracts name, display_name, and type-specific fields but never a description.
  2. EntitySummarizer is a no-op — The summarizer (postprocessing/entity_summarizer.py) consolidates fragmented descriptions into coherent summaries, but finds zero entities to process because the description property doesn't exist in the database.
  3. Community summaries are shallowerCommunitySummarizer attempts to read n.description for richer community summaries but falls back to names and labels only, producing less informative results.
  4. RAG retrieval misses semantic context — Entity matching currently relies on name only. A description like "the practice of linking requirements to downstream artifacts to ensure completeness and enable impact analysis" would give retrievers far richer semantic context.

Evidence from staging pipeline run (2026-02-24)

WARNING:neo4j.notifications: The property `description` does not exist in database `neo4j`.
2026-02-24 16:29:46 [info] No entities with fragmented descriptions found
    Summarized 0 entities

The Neo4j warning confirms that no entity in the graph has a description property. The EntitySummarizer correctly returns immediately with zero work.

Verification query:

MATCH (n:__Entity__) WHERE n.description IS NOT NULL RETURN count(n)
-- Returns 0

Root Cause

In src/graphrag_kg_pipeline/extraction/schema.py, the NODE_TYPES dict defines properties for each entity type. None of the 12 types include a description property:

  • Concept: name, display_name, definition, aliases
  • Challenge: name, display_name, severity
  • Artifact: name, display_name, artifact_type
  • Bestpractice: name, display_name, rationale
  • Processstage: name, display_name, sequence
  • Role: name, display_name, responsibilities
  • Standard: name, display_name, organization, domain
  • Tool: name, display_name, vendor, tool_type
  • Methodology: name, display_name, approach
  • Industry: name, display_name, regulated
  • Organization: name, display_name, organization_type, domain
  • Outcome: name, display_name, outcome_type

The SimpleKGPipeline LLM extraction prompt only asks the LLM to extract properties listed in the schema, so descriptions are never produced.

Proposed Solution

Add a description property to all 12 entity types in NODE_TYPES:

"description": {
    "type": "STRING",
    "required": False,
    "description": "One-sentence description of this entity in the context of requirements management",
},

Downstream effects (already implemented, will activate automatically)

  1. EntitySummarizer — Will find entities with multi-fragment descriptions (>200 chars from multi-chunk extraction) and consolidate them via LLM into clean 1-3 sentence summaries. Currently implemented and tested but has zero work to do.
  2. CommunitySummarizer — Already reads n.description in its community member query. Will produce richer community summaries without code changes.
  3. API repo retrievaltext2cypher.py and entity search will benefit from richer entity context. No changes needed in the API repo.

Cost and runtime impact

  • Extraction: ~10-20% more tokens per article (LLM must extract an additional property per entity). Estimated additional cost: ~$1-2 on a full pipeline run.
  • EntitySummarizer: Will now make LLM calls for entities with fragmented descriptions. Estimated: ~$0.50-1.00 (gpt-4o, ~100 entities with fragments).
  • Total additional cost per full run: ~$1.50-3.00 (on top of existing ~$9-17).
  • No additional runtime for community embeddings or vector indexes — descriptions don't affect those.

Implementation steps

  1. Add description property to all 12 entity types in extraction/schema.py
  2. Verify EntitySummarizer activates on a staging run (should find entities with >200 char descriptions)
  3. Compare community summary quality with/without entity descriptions
  4. Update tests if schema property counts change in assertions
  5. Run full staging pipeline to validate end-to-end

What NOT to change

  • EntitySummarizer code — Already correctly implemented, just needs data to work with
  • CommunitySummarizer code — Already reads descriptions with graceful fallback
  • Extraction promptsSimpleKGPipeline auto-generates prompts from the schema; adding the property is sufficient
  • Validation queries — No description-related checks currently exist

Context

  • Concept has a definition property (specific to Concept), but description is a general-purpose field for all entity types
  • The definition property on Concept serves a different purpose — it captures formal definitions from the glossary, not contextual descriptions from extraction
  • This was identified during the first staging pipeline run against graphrag-api-db-stage (local Neo4j Desktop instance)

Labels

Enhancement, Pipeline

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions