Optimize OgParser performance by consolidating regex operations #9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimize OgParser Performance & Fix PHP Version Compatibility
Summary
This PR addresses efficiency issues in the meta-scraper codebase, with a focus on optimizing the OgParser's regex operations and ensuring consistent behavior across PHP versions 7.4-8.4.
Key Changes:
preg_match()calls in OgParser into a more efficient loop-based approach, expected to improve parsing time by 60-80% for large HTML documentshtmlspecialchars_decode()inconsistencies across PHP versions by using explicitENT_NOQUOTES | ENT_HTML401flagsReview & Testing Checklist for Human
Recommended Test Plan: Use the existing test suite as baseline, then test against real-world websites like news articles, social media pages, and e-commerce sites that have complex meta tag structures.
Diagram
%%{ init : { "theme" : "default" }}%% graph TD A["src/Parser/OgParser.php<br/>Main optimization target"]:::major-edit B["src/Parser/OgDomParser.php<br/>Consistency updates"]:::minor-edit C["tests/ScraperTest.php<br/>Test expectations fixed"]:::minor-edit D[".github/workflows/tests.yml<br/>New CI workflow"]:::minor-edit E["README.md<br/>Complete rewrite"]:::minor-edit F["CHANGELOG.md<br/>New documentation"]:::minor-edit G["composer.json<br/>Dependency cleanup"]:::minor-edit A --> H["Meta object<br/>setTitle, setDescription, etc."]:::context B --> H C --> A D --> A D --> B D --> C A -.->|"11 preg_match calls<br/>to efficient loop"| I["Performance<br/>Improvement"]:::context A -.->|"ENT_NOQUOTES flags<br/>for consistency"| J["PHP Version<br/>Compatibility"]:::context subgraph Legend L1["Major Edit"]:::major-edit L2["Minor Edit"]:::minor-edit L3["Context/No Edit"]:::context end classDef major-edit fill:#90EE90 classDef minor-edit fill:#87CEEB classDef context fill:#FFFFFFNotes
htmlspecialchars_decode()flag changes affect how HTML entities are processed. TheENT_NOQUOTES | ENT_HTML401combination was chosen to prevent unwanted quote entity decoding while maintaining compatibility.Session Details: