forked from datalogism/SciLEx
-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
data-qualityData quality, deduplication, metadata completenessData quality, deduplication, metadata completenessenhancementNew feature or requestNew feature or request
Description
Problem
HAL provides 0% DOI coverage — in a recent 133K-paper collection, all 5105 HAL papers had no DOI. Papers without DOI cannot get citation counts and bypass citation filtering entirely, reducing the quality of the aggregated output.
Proposed Solution
Add a DOI enrichment step for HAL papers during aggregation:
- For each HAL paper missing a DOI, query the CrossRef API with title + first author
- Use fuzzy matching (threshold ~90%) to validate the returned DOI matches the original paper
- Write the recovered DOI back into the aggregated data before citation fetching
Expected Impact
- Citation coverage: HAL papers would participate in citation filtering instead of getting a free pass
- Deduplication: More HAL papers would match against papers from other APIs (DOI is the primary dedup key)
- Quality: Better relevance ranking since citation scores would be available
Technical Notes
- CrossRef
/worksendpoint supportsquery.titleandquery.authorparameters - Rate limit: ~3 req/sec without polite pool, ~10 req/sec with
mailtoconfigured - Could reuse existing CrossRef infrastructure in
scilex/citations/citations_tools.py - Should be optional (config flag) since it adds API calls during aggregation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
data-qualityData quality, deduplication, metadata completenessData quality, deduplication, metadata completenessenhancementNew feature or requestNew feature or request