The platform uses a hybrid data model:
- version-controlled repository data for the seed corpus and static research structures;
- runtime JSON data for mutable collaborative state.
This document explains both.
This is the seed corpus used to initialize the runtime document store when runtime-data/documents.json does not yet exist.
A seed record includes fields such as:
idtitleshortTitleyearplacelanguagerecordTypestatussourceHostscoperesearchLenseshistoricalTermsdescriptionhowDescribedtreatmentNoteactorsNotepatientNotefrequencyNoteeconomyNotereportingNoteattachments
Attachment entries typically include:
labelkindurlremoteUrldownload
Important distinction:
records.jsonis repository-curated data;- the live application reads it only for first-boot seeding and for OCR file resolution.
This is a prospecting dataset, not the live corpus.
It contains identified candidate sources that are visible in the UI but are not yet part of the mutable document runtime store unless separately uploaded or curated.
Typical fields:
idshortTitletitleyearcreatorrecordTypesourceHostfocusurlocrUrl
Static discovery resources for future corpus expansion.
Static research tracks or thematic acquisition lanes used by the interface.
The runtime store defaults to:
runtime-data/terms.jsonruntime-data/documents.jsonruntime-data/uploads/
The root can be changed with DATA_DIR.
Terms are stored as an array in terms.json.
Shape:
{
"id": "cancer-breast",
"canonical": "cancer of the breast",
"variants": ["breast cancer", "cancerous breast"],
"category": "enfermedad",
"notes": "Formula canonica amplia para detectar menciones directas."
}Semantics:
idis the stable key used by mentions and updates.canonicalis the display and grouping term.variantsare alternate literal forms used for detection.categoryis a lightweight historical grouping dimension.notescaptures editorial or methodological rationale.
Seed terms are defined directly in server/store.mjs.
Documents are stored as an array in documents.json.
Shape:
{
"id": "rowley-1772",
"title": "A Practical Treatise on Diseases of the Breast of Women",
"shortTitle": "Rowley, 1772",
"year": 1772,
"place": "Londres",
"language": "inglés",
"recordType": "manual de partería",
"sourceHost": "Internet Archive",
"contributorName": "Seed corpus",
"contributorRole": "sistema",
"notes": "Manual sobre padecimientos mamarios...",
"summary": "Manual sobre padecimientos mamarios...\n\n...",
"textPath": "/absolute/path/to/public/raw/ocr/rowley-1772.txt",
"originalFilePath": "",
"sourceLinks": [],
"createdAt": "2026-04-13T00:00:00.000Z",
"reviewStatus": "seed"
}For uploaded documents:
textPathusually points intoruntime-data/uploads/;originalFilePathmay contain the stored uploaded binary;reviewStatusdefaults tonuevo.
When documents are returned through the API, the server adds:
uploadTextUrluploadFileUrl
These are URL-safe public paths for client access.
When analysis needs a document’s text:
- if
textPathexists and the file is present, use that file; - otherwise use
summary; - otherwise use
notes; - otherwise use an empty string.
This means a document can participate in analysis even when no OCR file exists, as long as at least summary text is available.
The following structures are not persisted as primary storage. They are derived at runtime:
analysis.summaryanalysis.timelineanalysis.topTermsanalysis.cooccurrencesanalysis.mentionsanalysis.documentsWithMentions- chunk vectors used for contextual similarity
These are recalculated in-process and cached in memory until invalidated.
Mentions are generated from chunk-level term matching.
Typical shape:
{
"id": "rowley-1772:chunk:0:cancer-breast:0",
"documentId": "rowley-1772",
"documentTitle": "Rowley, 1772",
"year": 1772,
"place": "Londres",
"recordType": "manual de partería",
"termId": "cancer-breast",
"canonicalTerm": "cancer of the breast",
"matchedText": "cancer",
"snippet": "....",
"chunkId": "rowley-1772:chunk:0"
}Chunks are internal analysis structures built by splitting text into overlapping windows.
Current defaults in server/analysis.mjs:
- chunk size:
900characters - overlap:
220characters
These are implementation details but materially affect similarity quality and performance.
On first boot:
terms.jsonis created from seed terms inserver/store.mjs;documents.jsonis created fromsrc/data/records.json;uploads/directory is created.
After those files exist:
- runtime JSON becomes authoritative;
- editing repository seed files does not retroactively rewrite runtime JSON;
- operators must explicitly migrate or regenerate runtime data if they want repository changes applied to an existing deployment.
runtime-data/must be backed up.- Restoring
runtime-data/restores live mutable state. - Deleting
runtime-data/effectively resets the instance to seed state plus any repository OCR assets. - Multiple writable instances sharing no storage will diverge immediately.
- formal schema validation for terms and documents;
- explicit migration scripts for runtime state;
- provenance and change history for editorial actions;
- richer review-state and moderation fields;
- import/export tooling for batch curation.