Bulk import PDFs from a local folder

## Summary

Support importing an entire folder of PDFs into PaperShelf. This is the main onboarding path for users who already have a local library of papers.

## Current state

PaperShelf is arXiv-centric: every paper requires an `arxivId` as its primary identifier. Metadata is fetched from the arXiv API, and citations are resolved via Semantic Scholar using that arXiv ID. There is no support for papers that don't have an arXiv entry.

## Questions to resolve

### 1. Matching local PDFs to arXiv entries

Given a folder of PDFs with arbitrary filenames, how do we find the corresponding arXiv paper?

Possible approaches:
- **Extract text from PDF → search arXiv by title.** Use the existing `pdf-parse` text extraction, pull the title from the first page, and query the arXiv API. Fragile — titles may not match exactly (formatting, special characters, line breaks).
- **Extract DOI from PDF metadata or text → resolve to arXiv ID.** Many PDFs embed a DOI in their metadata or first page. Use CrossRef or Semantic Scholar to map DOI → arXiv ID.
- **Use Semantic Scholar title search as a fuzzy fallback.** S2 has a paper search endpoint that returns arXiv IDs when available. Could be more tolerant of title variations than the arXiv API.
- **Fingerprint-based matching.** Hash the PDF content or extract a unique identifier (e.g., first N characters of abstract) and match against known databases.

We likely need a pipeline that tries multiple strategies in order and presents uncertain matches to the user for confirmation.

### 2. Handling papers with no arXiv entry

Many papers (conference proceedings, journals, theses, internal reports, non-CS fields) are not on arXiv at all. The current data model requires `arxivId` as a UNIQUE non-null column.

Questions:
- **Schema change:** Should `arxivId` become nullable? Or should we introduce a more general identifier system (DOI, S2 ID, ISBN, or a synthetic internal ID)?
- **Metadata sources for non-arXiv papers:** CrossRef (via DOI), Semantic Scholar (via title search or DOI), Google Scholar, or manual entry?
- **Citation graph:** Semantic Scholar can resolve citations by DOI or S2 ID, not just arXiv ID. Should we generalize the citation fetching to support multiple identifier types?
- **User experience:** What does a "paper without metadata" look like in the UI? Minimal card with just filename + extracted title? A prompt to manually fill in metadata?

### 3. Import UX

- **Progress feedback:** Importing hundreds of PDFs with text extraction + metadata resolution will be slow. Need a progress indicator and the ability to cancel.
- **Conflict resolution:** What if a PDF matches a paper already in the library? Skip, overwrite, or ask?
- **Batch review:** After matching, show the user a list of matched/unmatched papers so they can confirm or correct before committing to the database.
- **Folder watching (future):** Should we support watching a folder for new PDFs and auto-importing? Or is one-time import enough for v1?

## Proposed approach (high-level)

1. User selects a folder via native file dialog
2. App scans for `*.pdf` files
3. For each PDF:
   a. Extract text and metadata (title, DOI if present)
   b. Try to match: DOI → S2/CrossRef → arXiv ID, then title search as fallback
   c. Classify as: matched (high confidence), uncertain (needs review), unmatched
4. Show import review screen with results
5. User confirms → papers are saved to database

## Related work needed

- Generalize the data model to support non-arXiv papers (nullable `arxivId` or a new identifier system)
- Add DOI resolution (CrossRef API or Semantic Scholar DOI lookup)
- Add a bulk import IPC channel and progress reporting mechanism

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bulk import PDFs from a local folder #57

Summary

Current state

Questions to resolve

1. Matching local PDFs to arXiv entries

2. Handling papers with no arXiv entry

3. Import UX

Proposed approach (high-level)

Related work needed

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Bulk import PDFs from a local folder #57

Description

Summary

Current state

Questions to resolve

1. Matching local PDFs to arXiv entries

2. Handling papers with no arXiv entry

3. Import UX

Proposed approach (high-level)

Related work needed

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions