feat: Add Logseq integration with flat export structure and AI-powered metadata enhancement#2
Open
victorespigares wants to merge 21 commits intoliketheduck:mainfrom
Conversation
- Add PDF export functionality with invisible OCR text layer - Fix coordinate conversion for Qwen3-VL (0-1000 normalized coordinates) - Add debug mode for PDF coordinate visualization - Update README with PDF export feature - Add comprehensive changelog - Clean up development files for public release Technical changes: - Fix fundamental coordinate system issue in pdf_exporter.py - Match coordinate conversion logic with note_processor.py - Add proper font sizing and text positioning - Add debug-pdf-bbox.sh script for troubleshooting This resolves the PDF bounding box alignment issue and provides pixel-perfect searchable PDFs that match the original handwritten content.
- Add supernote-monitor.sh: Full-featured app monitor with detailed logging - Add quick-launch.sh: Simple launcher with minimal overhead - Update README.md with documentation for both scripts - Features include: * Auto-launch Supernote app * Monitor app closure * Automatic OCR processing when app closes * Built-in delay for file sync completion * Graceful error handling and Ctrl+C support This provides a seamless workflow: use Supernote normally, close the app, and get OCR processing automatically.
Major refactor of Logseq export functionality: NEW FEATURES: - Flat file structure: All .md files in pages/ (no subdirectories) - Hierarchical properties: source/path/tags preserve original structure - Conflict resolution: Unique names based on path (ProyectoA_Cliente1_nota1.md) - Property merging: Intelligent merge with existing properties CORE FUNCTIONS: - build_flat_filename_from_path(): Generate flat filenames from paths - build_page_properties_from_path(): Create hierarchical properties - merge_properties_with_content(): Merge properties without breaking existing TRANSFORMATION EXAMPLES: - supernote_export/ProyectoA/Cliente1/nota1.md → logseq/pages/ProyectoA_Cliente1_nota1.md → Properties: source:: Supernote, path:: Supernote/ProyectoA/Cliente1 → Tags: [[Supernote]], [[Supernote/ProyectoA]], [[Supernote/ProyectoA/Cliente1]] BENEFITS: - Easier backup/sync with flat structure - Powerful search through Logseq tag namespaces - Preserved hierarchy in properties - Automatic conflict resolution - Native Logseq property format TECHNICAL: - Added comprehensive test suite (test_logseq_flat.py) - Full backwards compatibility - Detailed documentation in docs/LOGSEQ_FLAT_EXPORT.md - Property format: key:: value (not YAML) - Tag namespaces: [[Supernote/ProyectoA/Cliente1]] This addresses the user's requirement for flat structure while maintaining hierarchical information through Logseq-native properties and tags.
BREAKING CHANGE: Configuration now loaded from .env.local instead of hardcoded CHANGES: - Move all environment variables from run-native.sh to .env.local - Update run-native.sh to load and validate .env.local - Add configuration display on startup - Improve error handling for missing .env.local BENEFITS: - Centralized configuration in .env.local - Easier to maintain and update settings - Better visibility of current configuration - Separation of code and configuration - .env.local remains private (gitignored) VARIABLES MOVED: - SUPERNOTE_DATA_PATH, OCR_API_URL, STORAGE_MODE - OCR_TXT_EXPORT_ENABLED, OCR_TXT_EXPORT_PATH - OCR_PDF_EXPORT_ENABLED, OCR_PDF_EXPORT_PATH - LOGSEQ_EXPORT_ENABLED, LOGSEQ_PAGES_PATH, LOGSEQ_ASSETS_PATH - AI_TEXT_CLEANUP_ENABLED, PDF_DEBUG_MODE - All other processing settings USAGE: 1. Edit .env.local with your settings 2. Run ./run-native.sh (will load .env.local automatically) 3. See configuration summary on startup
MINIMAL IMPLEMENTATION - Cherry on top feature: NEW FUNCTIONALITY: - Quick visual content detection before OCR processing - Adds [📸 Dibujo] markers when drawings detected - Fails silently if detection fails (no impact on core OCR) IMPLEMENTATION: - detect_visual_content(): Lightweight yes/no detection - visual_detection prompt: Simple 1-word response - Integration in main.py: Prepend marker to OCR text - Adds text block for proper positioning in PDFs BENEFITS: - Know where you had drawings in your notes - Zero impact on core OCR functionality - Minimal overhead (~1-2 seconds per page) - Clean failure handling USAGE: - Automatic during OCR processing - No configuration needed - Markers appear in text: [📸 Dibujo] original text... TECHNICAL: - Uses existing Qwen2.5-VL model with new prompt - 10 token limit, 30s timeout for speed - Keyword detection: yes/drawing/diagram/sketch/image/picture/chart/graph - Silent fallback if detection fails DESIGN PHILOSOPHY: - Maximum value, minimum complexity - Does not affect core OCR workflow - Simple and reliable implementation - Product manager approved feature scope
PROBLEM: PDFs not being copied to Logseq assets, .note files appearing instead CHANGES: - Add detailed logging for PDF source path and asset path - Log PDF existence and file size - Fix PDF generation for Logseq when PDF export disabled - Use temporary directory for PDF generation to avoid conflicts - Add logging in main.py for pdf_path passing DIAGNOSTIC INFO: - Log PDF source path, existence, and size - Log PDF asset path and copy success - Log pdf_path being passed to Logseq export - Better error visibility for troubleshooting This will help identify where the PDF copying is failing and why .note files might appear in assets instead.
PROBLEM: Debug logs not showing with LOG_LEVEL=DEBUG CHANGES: - Change debug logs to INFO level for PDF assets tracking - Add emoji markers for easy identification in logs - Fix syntax error with duplicate else block - Always show PDF path and existence in main.py DIAGNOSTIC MARKERS: - 📄 Logseq PDF - Source: /path/to/source.pdf - 📄 Logseq PDF - Asset: /path/to/asset.pdf - 📄 Logseq PDF - Source exists: True/False - 📄 Logseq PDF - Source size: 12345 bytes - 📄 Logseq PDF - Copied successfully - 📄 Logseq PDF - Asset exists: True/False This will help identify exactly where the PDF copying process is failing and why .note files appear instead of PDFs.
PROBLEM IDENTIFIED:
- PDF assets were being saved as .note files instead of .pdf
- Root cause: flat_filename.replace('.md', '.pdf') but filename was .note
SOLUTION:
- Replace .note with .md for markdown files
- Replace .note with .pdf for PDF assets
- Keep flat_filename generation unchanged (it works correctly)
BEFORE:
📄 Logseq PDF - Asset: .../Note_20251230_Comidas semana Navidades.note
AFTER:
📄 Logseq PDF - Asset: .../Note_20251230_Comidas semana Navidades.pdf
This fixes the regression where PDFs weren't being properly
saved in Logseq assets folder.
MAJOR IMPROVEMENT: Enhanced front matter and intelligent tagging NEW FEATURES: - AI-powered content type classification (Meal-Planning, Meeting, Notes, etc.) - Smart date extraction from filename and content - Language detection (Spanish/English) - Enhanced tag generation with hierarchical structure - Improved front matter with comprehensive metadata TECHNICAL IMPLEMENTATION: - MetadataAnalyzer class with rule-based and AI analysis - Content classification using keyword patterns - Tag mapping for better discoverability - Date extraction from YYYYMMDD filenames - Enhanced front matter builder FRONT MATTER EXAMPLE: BENEFITS: - Better organization through intelligent classification - Enhanced discoverability with hierarchical tags - Automatic date extraction from filenames - Language-aware processing - Consistent metadata structure CONTENT TYPES SUPPORTED: - Meal-Planning, Meeting, Notes, Ideas, Planning, Calendar, Other TAG HIERARCHY: - Primary: [[Supernote/ContentType]] - Related: [[Planning/Food]], [[Work/Meetings]], etc. - Content-specific: [[food]], [[project]], etc. This transforms basic OCR export into intelligent, organized knowledge management.
ISSUES FIXED: 1. PDF link was pointing to .note files instead of .pdf files 2. Note titles included 'Note_' prefix and dates CHANGES: - Added clean_note_title() function to remove date prefixes and 'Note_' prefix - Fixed PDF link generation to use correct .pdf extension - Enhanced PDF link with clean, readable titles EXAMPLES: - 'Note_20251230_Comidas semana Navidades.note' → 'Comidas semana Navidades' - PDF link:  BENEFITS: - Clean, readable titles in Logseq - Correct PDF linking functionality - Better user experience with meaningful titles
FRONT MATTER CLEANUP: - Removed ocr-confidence from front matter properties - Removed language from front matter properties - Kept both fields in Logseq metadata block for visibility FRONT MATTER NOW CONTAINS: - source:: Supernote - path:: Supernote/ContentType - date:: YYYY-MM-DD (if extracted) - processed:: YYYY-MM-DD - type:: [[Supernote/ContentType]] - tags:: [[Enhanced]], [[Hierarchical]], [[Tags]] METADATA BLOCK STILL INCLUDES: - ¦ - **Confianza OCR**: 95.0% - ¦ - **Idioma**: es BENEFITS: - Cleaner, more focused front matter - Essential metadata still visible in content - Better separation of concerns (properties vs display info)
CORRECTION: Keep OCR confidence and language in front matter, remove from content block FRONT MATTER (Properties): - source:: Supernote - path:: Supernote/ContentType - date:: YYYY-MM-DD - processed:: YYYY-MM-DD - ocr-confidence:: 95.0% ← MOVED HERE - language:: es ← MOVED HERE - type:: [[Supernote/ContentType]] - tags:: [[Enhanced]], [[Tags]] CONTENT METADATA BLOCK: - ¦ - **Fecha procesamiento**: [[January 16, 2026]] - ¦ - **Páginas**: 1 - ¦ - **Palabras**: 45 - ¦ - **Tipo contenido**: Meal-Planning - ¦ - **Fecha nota**: [[2025-12-30]] BENEFITS: - Front matter contains all searchable properties - Cleaner content block with essential info only - Better separation of concerns
MAJOR REDESIGN: Enhanced Logseq export with native format and comprehensive metadata
FRONT MATTER ENHANCEMENTS:
- source:: [[Supernote]] (now a link)
- processed:: [[Jan 16th, 2026]] (Logseq date format)
- Added pages:: and words:: properties
- Complete metadata in front matter
NEW NATIVE OUTLINE STRUCTURE:
- Page headers: "- Página 1/4", "- Página 2/4"
- Proper Logseq hierarchy with indentation
- Content as child elements of page headers
- Clean paragraph separation
CONTENT FORMAT CHANGES:
- Removed metadata block from content
- Added "## Resumen generado" section
- Native Logseq bullet structure
- Page-by-page organization
EXAMPLE OUTPUT:
- ## Resumen generado
- AI-generated summary here...
- ## Contenido
- Página 1/4
- Content from page 1...
- Página 2/4
- Content from page 2...
BENEFITS:
- Perfect Logseq integration with native format
- Enhanced searchability with comprehensive front matter
- Better organization with page structure
- Clean separation of metadata and content
- Improved readability and navigation
PAGE TITLE CLEANUP: - Added clean_page_title() function to remove 'Note_' and date prefixes - Enhanced format_content_for_logseq_outline() to accept note title - Single-page notes now use clean title as header instead of 'Página 1/1' - Multi-page notes continue using 'Página N/M' format EXAMPLES: - 'Note_20251230_Comidas semana Navidades' -> 'Comidas semana Navidades' - '20251230_Meeting notes' -> 'Meeting notes' - 'Note_Project ideas' -> 'Project ideas' BEHAVIOR: - Single page: Uses clean title as header - Multi-page: Uses 'Página 1/4', 'Página 2/4', etc. - All date prefixes removed from headers - All 'Note_' prefixes removed from headers BENEFITS: - Cleaner, more readable page headers - No redundant date information (already in front matter) - Better user experience with meaningful titles - Consistent title cleaning across PDF links and page headers
FILENAME CLEANUP: - Modified flat_filename generation to use clean_page_title() - Now removes 'Note_' and 'YYYYMMDD_' prefixes from output files - Updated all references to work with cleaned filenames EXAMPLES: - 'Note_20251230_Comidas semana Navidades.md' -> 'Comidas semana Navidades.md' - 'Note_Libro Anticonformismo snippet.md' -> 'Libro Anticonformismo snippet.md' - '20251230_Meeting notes.md' -> 'Meeting notes.md' TECHNICAL CHANGES: - flat_filename now includes .md extension after cleaning - Updated md_output_path to use flat_filename directly - Updated pdf_asset_path to replace .md with .pdf - Updated PDF link generation to use .md to .pdf replacement BENEFITS: - Clean, readable filenames in Logseq pages directory - No redundant prefixes in file names - Consistent with clean titles used in content - Better file organization and readability
ISSUE: Filenames were appearing as '20251230_Comidas semana Navidades.note.md' ROOT CAUSE: .note extension was not removed before applying clean_page_title() FIX: - Remove .note extension first - Then apply clean_page_title() to remove date and Note_ prefixes - Finally add .md extension RESULT: - '20251230_Comidas semana Navidades.note.md' -> 'Comidas semana Navidades.md' - 'Libro Anticonformismo snippet.note.md' -> 'Libro Anticonformismo snippet.md'
ISSUE: Files like 'Note_20251230_Comidas...' were becoming '20251230_Comidas...' ROOT CAUSE: Removing date first left 'Note_' which was then removed, but date remained FIX: - Changed order: Remove 'Note_' prefix FIRST, then remove date prefix - Applied to both clean_note_title() and clean_page_title() LOGIC: 1. 'Note_20251230_Comidas semana Navidades' 2. Remove 'Note_' -> '20251230_Comidas semana Navidades' 3. Remove date -> 'Comidas semana Navidades' RESULT: - 'Note_20251230_Comidas semana Navidades.md' -> 'Comidas semana Navidades.md' - '20251230_Comidas semana Navidades.md' -> 'Comidas semana Navidades.md' - 'Note_Libro Anticonformismo snippet.md' -> 'Libro Anticonformismo snippet.md'
COMPREHENSIVE TRANSLATION: - Translated all user-facing strings from Spanish to English - Updated code comments and documentation to English - Maintained backward compatibility by keeping Spanish keywords in detection patterns FILES UPDATED: app/logseq_exporter.py: - 'Página' → 'Page' (page headers) - 'Resumen generado' → 'Generated Summary' - 'Contenido' → 'Content' app/metadata_analyzer.py: - Content type keywords: Added English equivalents while keeping Spanish for compatibility - AI prompts: Fully translated to English - Pattern matching: Bilingual support (English primary, Spanish secondary) - Examples in docstrings: Translated to English BACKWARD COMPATIBILITY: - All Spanish keywords remain in detection patterns - Existing Spanish notes will continue to be classified correctly - Both English and Spanish content supported EXAMPLES: Before: '- ## Resumen generado' After: '- ## Generated Summary' Before: 'Página 1/4' After: 'Page 1/4' Before: 'comidas', 'desayuno', 'almuerzo' After: 'meals', 'breakfast', 'lunch' (+ Spanish kept) NO REGRESSIONS: - All functionality preserved - No breaking changes - Tests pass (bilingual support maintained)
ISSUE: UnboundLocalError when extracting tags for non-Meal-Planning types ROOT CAUSE: Inconsistent variable naming (food_patterns vs all_patterns) FIX: - Use consistent 'all_patterns' variable name across all content types - Removed redundant conditional assignment VERIFICATION: - Tested with English content: ✅ Working - Tested with Spanish content: ✅ Working - Tested Meal-Planning (EN): ✅ Working - Tested Meal-Planning (ES): ✅ Working All bilingual support maintained and working correctly.
- Ensure repository is safe for public sharing and pull request
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🎯 Overview
This PR adds comprehensive Logseq integration to the Supernote OCR Enhancer, enabling seamless export of OCR-processed notes to Logseq with AI-powered metadata extraction, intelligent tagging, and searchable PDF generation.
✨ Key Features
1. Logseq Flat Export Structure
../assets/paths2. AI-Powered Metadata Enhancement
#recipe,#meeting,#python)3. Searchable PDF Export
4. Enhanced OCR Processing
🔧 Technical Improvements
logseq_exportstable for tracking export state📝 Documentation
.envexamples for all features🧪 Testing
📦 What's Included
New Files:
app/logseq_exporter.py- Core Logseq export logicapp/metadata_analyzer.py- AI-powered content analysisdocs/LOGSEQ_INTEGRATION.md- Complete integration guidedocs/LOGSEQ_FLAT_EXPORT.md- Export structure documentationdocs/LOGSEQ_PDF_FLOW.md- PDF workflow guidetest_logseq_flat.py- Export functionality testsEnhanced Files:
app/main.py- Integrated Logseq export into main processing loopapp/text_processor.py- AI text cleanup and summarizationapp/pdf_exporter.py- Searchable PDF generation.env.example- Added Logseq configuration options🎨 Use Cases
This enhancement enables powerful workflows:
🔄 Backward Compatibility
📊 Impact
🙏 Acknowledgments
Built on the excellent foundation of the original Supernote OCR Enhancer. This contribution aims to extend its utility for users who want to integrate their Supernote workflow with Logseq and other knowledge management tools.