dwrpubs builds a curated inventory of peer-reviewed publications that the California Department of Water Resources supported through funding or authorship. The package pulls together manual DOI curation, Crossref and OpenAlex harvesting, metadata normalization, and discipline tagging so staff can publish consistent peer-reviewed manuscript inventories for reports, dashboards, and other communications. Large language models (LLMs) are used to standardize and classify unstructured text.
- Inputs: DOI lists, Crossref metadata API results, OpenAlex metadata API results, and staff rosters for attribution.
- Processing: normalization pipelines for authors, affiliations, and contribution flags plus user-guided overrides.
- Classification: LLM-powered discipline tagging against a user-supplied taxonomy.
- Outputs: ready-to-use datasets and a human-readable, denormalized inventory CSV file.
Forthcoming.
Forthcoming.
- Affiliation cleanup:
data-raw/all_metadata.Roptionally calls LLMs to canonicalize institution names pulled from Crossref and OpenAlex, falling back to cached lookups when API credentials are unavailable or the user does not wish to generate a new lookup object. - Discipline tagging:
data-raw/classified_inventory.Rbatches article titles and abstracts through Gemini, seeding the prompt with the user-maintained taxonomy (data/disciplines_taxonomy.rda) so that classifications adhere to that controlled vocabulary—a lightweight RAG pattern.
- Add quick-start instructions to this README.
- Document the use and expansion of manual overrides in automated processing scripts.
- Flesh out
vignettes/generate_inventory.qmdwith an end-to-end workflow walkthrough. - Add package-level documentation that introduces the data objects and exported helpers.
- Create automated tests for package functions.
- Wrap the
data-raw/scripts in a reproducible refresh pipeline that handles API credentials. - Configure CI to run
R CMD checkand guard the data and package refresh process. - Add README on LLMs for unstructured text parsing and classification, including commentary on non-deterministic behavior.
- Create a visualization dashboard, static figures, and other communications products.
Released under the MIT License.
