Skip to content

Bulk import PDFs from a local folder #57

@dakl

Description

@dakl

Summary

Support importing an entire folder of PDFs into PaperShelf. This is the main onboarding path for users who already have a local library of papers.

Current state

PaperShelf is arXiv-centric: every paper requires an arxivId as its primary identifier. Metadata is fetched from the arXiv API, and citations are resolved via Semantic Scholar using that arXiv ID. There is no support for papers that don't have an arXiv entry.

Questions to resolve

1. Matching local PDFs to arXiv entries

Given a folder of PDFs with arbitrary filenames, how do we find the corresponding arXiv paper?

Possible approaches:

  • Extract text from PDF → search arXiv by title. Use the existing pdf-parse text extraction, pull the title from the first page, and query the arXiv API. Fragile — titles may not match exactly (formatting, special characters, line breaks).
  • Extract DOI from PDF metadata or text → resolve to arXiv ID. Many PDFs embed a DOI in their metadata or first page. Use CrossRef or Semantic Scholar to map DOI → arXiv ID.
  • Use Semantic Scholar title search as a fuzzy fallback. S2 has a paper search endpoint that returns arXiv IDs when available. Could be more tolerant of title variations than the arXiv API.
  • Fingerprint-based matching. Hash the PDF content or extract a unique identifier (e.g., first N characters of abstract) and match against known databases.

We likely need a pipeline that tries multiple strategies in order and presents uncertain matches to the user for confirmation.

2. Handling papers with no arXiv entry

Many papers (conference proceedings, journals, theses, internal reports, non-CS fields) are not on arXiv at all. The current data model requires arxivId as a UNIQUE non-null column.

Questions:

  • Schema change: Should arxivId become nullable? Or should we introduce a more general identifier system (DOI, S2 ID, ISBN, or a synthetic internal ID)?
  • Metadata sources for non-arXiv papers: CrossRef (via DOI), Semantic Scholar (via title search or DOI), Google Scholar, or manual entry?
  • Citation graph: Semantic Scholar can resolve citations by DOI or S2 ID, not just arXiv ID. Should we generalize the citation fetching to support multiple identifier types?
  • User experience: What does a "paper without metadata" look like in the UI? Minimal card with just filename + extracted title? A prompt to manually fill in metadata?

3. Import UX

  • Progress feedback: Importing hundreds of PDFs with text extraction + metadata resolution will be slow. Need a progress indicator and the ability to cancel.
  • Conflict resolution: What if a PDF matches a paper already in the library? Skip, overwrite, or ask?
  • Batch review: After matching, show the user a list of matched/unmatched papers so they can confirm or correct before committing to the database.
  • Folder watching (future): Should we support watching a folder for new PDFs and auto-importing? Or is one-time import enough for v1?

Proposed approach (high-level)

  1. User selects a folder via native file dialog
  2. App scans for *.pdf files
  3. For each PDF:
    a. Extract text and metadata (title, DOI if present)
    b. Try to match: DOI → S2/CrossRef → arXiv ID, then title search as fallback
    c. Classify as: matched (high confidence), uncertain (needs review), unmatched
  4. Show import review screen with results
  5. User confirms → papers are saved to database

Related work needed

  • Generalize the data model to support non-arXiv papers (nullable arxivId or a new identifier system)
  • Add DOI resolution (CrossRef API or Semantic Scholar DOI lookup)
  • Add a bulk import IPC channel and progress reporting mechanism

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Nice to haveenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions