Skip to content

Conversation

@dkiesow
Copy link

@dkiesow dkiesow commented Dec 17, 2025

Summary

This PR adds a new structured_data module that extracts metadata from standardized structured data formats (JSON-LD, OpenGraph, meta tags) before falling back to content-based extraction.

Changes

New Module: mcmetadata/structured_data.py

  • Extracts title, author, publication date, and description from JSON-LD structured data, OpenGraph tags, and canonical link tags
  • Prioritizes article-like JSON-LD nodes over generic webpage entries
  • Normalizes multi-author lists with comma separator
  • Detects wire service signals for syndicated content detection
  • Extracts canonical URLs for cross-domain wire detection

Updated: mcmetadata/init.py

  • Import and call structured_data.extract_from_html in main extract function
  • Prefer canonical URL from structured data before falling back to content extractor
  • Add structured_data timing to stats accumulator

Benefits

  • More accurate metadata extraction using standardized formats
  • Better handling of JSON-LD graphs with multiple entries
  • Canonical URL extraction improves downstream wire detection
  • Cleaner author name formatting

Testing

Tested against wire detection integration tests in downstream fork.

- Add new structured_data module for extracting metadata from JSON-LD and OpenGraph tags
- Prioritize article-like JSON-LD nodes over generic webpage entries
- Extract canonical URLs from link tags and use for wire detection
- Normalize multi-author lists with comma separator
- Integrate structured data into main extract() function
- Prefer structured canonical URL before content extractor fallback
@pgulley
Copy link
Member

pgulley commented Dec 18, 2025

Hey Damon, thanks for this!
Two big comments- first of all, our preference would be for the new return types to be toggle-able, optional behavior that defaults to off- that way we can integrate the changes and queue up some more thorough integration testing down the line. Secondly, and in the same vein, could you include some unit tests and metrics on the new behaviors?

@dkiesow
Copy link
Author

dkiesow commented Dec 18, 2025

Yep - got it. One of the complications of how my environment is set up but let me take a look at that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants