Process Markdown into Graph Structure by dstengle · Pull Request #64 · dstengle/knowledgebase-processor

dstengle · 2025-11-05T19:01:31Z

feat: Add markdown structure processing to graph
Description
Summary
Implements comprehensive markdown structure processing functionality that converts markdown elements (headings, sections, lists, tables, code blocks, and blockquotes) into RDF graph entities with proper relationships and metadata.

Key Changes

KB Entity Models (src/knowledgebase_processor/models/kb_entities.py)
Added 7 new Pydantic models for markdown structure:

KbHeading - Markdown headings (h1-h6) with level and hierarchy
KbSection - Content sections with heading relationships
KbList - Ordered/unordered lists with item counts
KbListItem - Individual list items with parent relationships
KbTable - Tables with row/column counts and headers
KbCodeBlock - Code blocks with language and line count
KbBlockquote - Blockquotes with nesting levels
All models include RDF property mappings, position tracking, and Schema.org types.

Markdown Structure Processor (src/knowledgebase_processor/processor/markdown_structure_processor.py)
Converts markdown elements to KB entities
Maintains parent-child relationships (heading↔section, list↔items)
Tracks position information (start/end line numbers)
Uses deterministic ID generation based on position for reproducibility
Provides statistics on extracted structure
Integration (src/knowledgebase_processor/processor/entity_processor.py)
Integrated into main processing pipeline
Automatically extracts structure from all documents
Processes alongside todos, wikilinks, and named entities
ID Generation (src/knowledgebase_processor/utils/id_generator.py)
Added generate_markdown_element_id() method
Deterministic URIs based on element type and position
Specification-Based Tests
Created 5 new test cases in specs/test_cases/:
markdown_structure_01_single_heading
markdown_structure_02_code_block
markdown_structure_03_list
markdown_structure_04_table
markdown_structure_05_blockquote
Regenerated all 60 existing spec test outputs to include new entities
Added scripts/regenerate_spec_outputs.py utility for batch updates
Impact
All markdown structure elements are now fully represented in the knowledge graph with:

✅ Proper RDF types and Schema.org mappings
✅ Position metadata (start/end line numbers)
✅ Parent-child relationships
✅ Queryable via SPARQL
✅ Deterministic, reproducible entity IDs
Test Plan

All 61 specification tests pass

RDF converter handles all new entity types

Deterministic ID generation ensures test reproducibility

Integration tests verify end-to-end processing

Spec tests use declarative approach per project standards
Testing Results
============================= test session starts ==============================
collected 61 items

tests/test_specifications.py::test_specifications PASSED x60
tests/test_specifications.py::test_test_cases_directory_exists PASSED

===================== 61 passed, 31 warnings in 1.51s =========================

Implements comprehensive markdown structure processing functionality that converts markdown elements (headings, sections, lists, tables, code blocks, and blockquotes) into RDF graph entities. Changes: - Add KB entity models for markdown structure elements (KbHeading, KbSection, KbList, KbListItem, KbTable, KbCodeBlock, KbBlockquote) - Create MarkdownStructureProcessor to convert markdown elements to KB entities with proper relationships - Integrate MarkdownStructureProcessor into EntityProcessor pipeline - Add generate_markdown_element_id method to EntityIdGenerator - Add comprehensive test coverage for all markdown structure types All markdown structure elements are now processed into the RDF graph with proper metadata including position information, nesting levels, and parent-child relationships. Tests: All 9 tests pass

Converts markdown structure processing tests to follow the project's specification-driven testing methodology instead of unit tests. Changes: - Remove unit test file from tests/processor directory - Create 5 new specification test cases for markdown structure: - markdown_structure_01_single_heading - markdown_structure_02_code_block - markdown_structure_03_list - markdown_structure_04_table - markdown_structure_05_blockquote - Update markdown structure processor to use deterministic IDs based on position instead of random UUIDs for sections, lists, tables, and code blocks - Regenerate all 60 spec test expected outputs to include new markdown structure entities in RDF graphs - Add regenerate_spec_outputs.py script for batch updating test expectations when processor output changes Test Results: All 61 specification tests pass This aligns with the project's specification-driven testing approach where behavior is captured in declarative artifacts (input.md and expected_output.ttl files) rather than imperative Python test code.

claude added 2 commits November 5, 2025 14:45

dstengle merged commit 04406d5 into main Nov 5, 2025
2 checks passed

dstengle deleted the claude/markdown-to-graph-processor-011CUpuvxAn2hbzC5SaNFv8x branch November 5, 2025 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process Markdown into Graph Structure#64

Process Markdown into Graph Structure#64
dstengle merged 2 commits intomainfrom
claude/markdown-to-graph-processor-011CUpuvxAn2hbzC5SaNFv8x

dstengle commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dstengle commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants