Skip to content

Enrich HAL papers with DOIs via CrossRef title search #36

@BenjaminNavet

Description

@BenjaminNavet

Problem

HAL provides 0% DOI coverage — in a recent 133K-paper collection, all 5105 HAL papers had no DOI. Papers without DOI cannot get citation counts and bypass citation filtering entirely, reducing the quality of the aggregated output.

Proposed Solution

Add a DOI enrichment step for HAL papers during aggregation:

  1. For each HAL paper missing a DOI, query the CrossRef API with title + first author
  2. Use fuzzy matching (threshold ~90%) to validate the returned DOI matches the original paper
  3. Write the recovered DOI back into the aggregated data before citation fetching

Expected Impact

  • Citation coverage: HAL papers would participate in citation filtering instead of getting a free pass
  • Deduplication: More HAL papers would match against papers from other APIs (DOI is the primary dedup key)
  • Quality: Better relevance ranking since citation scores would be available

Technical Notes

  • CrossRef /works endpoint supports query.title and query.author parameters
  • Rate limit: ~3 req/sec without polite pool, ~10 req/sec with mailto configured
  • Could reuse existing CrossRef infrastructure in scilex/citations/citations_tools.py
  • Should be optional (config flag) since it adds API calls during aggregation

Metadata

Metadata

Assignees

No one assigned

    Labels

    data-qualityData quality, deduplication, metadata completenessenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions