Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ Suggests:
RSQLite (>= 1.0.0),
scales,
testthat (>= 3.0.0),
textrecipes,
tidymodels,
tidyverse
VignetteBuilder:
knitr
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# nihexporter (development version)

* New `abstract_words` table containing tokenized words for project abstracts.

* New `projects_min` table, which contains a minimal subset of projects data from 2006-2024,
with both direct and indirect costs (2006 was the first year IC amounts were published).

Expand Down
7 changes: 7 additions & 0 deletions R/data.R
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,10 @@
#'
#' @source Computed from \link{projects} table.
"project_io"

#' Tokenized words from abstracts.
#'
#' @format A tibble with five variables: `activity`, `fiscal_year`, `institute`, `word`, `n`.
#'
#' @source \url{https://reporter.nih.gov/exporter/abstracts}
"abstract_words"
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ pak::pak("rnabioco/nihexporter")

* `project_io`: pre-computed `n.pubs`, `n.patents` and `project.cost` for each `project.num`.

**Note:** [Abstracts](https://reporter.nih.gov/exporter/abstracts) from NIH EXPORTER are not provided as they significantly increase the size of the package.
* `abstract_words`: tokenized words from [grant abstracts](https://reporter.nih.gov/exporter/abstracts).

## Functions

Expand Down
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,8 @@ time to download and install. ⚠️
- `project_io`: pre-computed `n.pubs`, `n.patents` and `project.cost`
for each `project.num`.

**Note:** [Abstracts](https://reporter.nih.gov/exporter/abstracts) from
NIH EXPORTER are not provided as they significantly increase the size of
the package.
- `abstract_words`: tokenized words from [grant
abstracts](https://reporter.nih.gov/exporter/abstracts).

## Functions

Expand Down
1 change: 1 addition & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@ reference:
- publinks
- patents
- clinical_studies
- abstract_words
69 changes: 69 additions & 0 deletions data-raw/abstracts.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# parse and tokenize abstracts

library(tidyverse)
library(tidytext)
library(here)

source("data-raw/common.R")

path <- here("data-raw/downloads/abstracts")

col_types <- cols_only(
APPLICATION_ID = col_double(),
ABSTRACT_TEXT = col_character(),
)

abstracts_raw_tbl <-
load_tbl(path, col_types) |>
left_join(projects, by = "application_id") |>
select(activity, fiscal_year, institute, abstract_text) |>
# extramural only
filter(!str_detect(activity, "^Z")) |>
na.omit() |>
unique()

data(stop_words)
custom_stop_words <- tibble(
word = c(
# generic to abstracts
"research",
"specific",
"studies",
"aim",
# meaningless annotations
"description",
"unreadable"
)
)

tokenize_words <- function(df) {
unnest_tokens(df, input = abstract_text, output = word) |>
# remove words that are numbers
filter(!str_detect(word, "^[0-9]*$")) |>
anti_join(stop_words) |>
anti_join(custom_stop_words) |>
count(activity, fiscal_year, institute, word, sort = TRUE) |>
filter(n >= 10)
}

df_splits <- group_by(abstracts_raw_tbl, fiscal_year, institute) |>
group_split()

# df_splits <- df_splits[1:10]

library(furrr)
library(progressr)
plan(multisession, workers = 12)
with_progress({
p <- progressor(steps = length(df_splits))

abstract_words <- future_map_dfr(
df_splits,
~{
p()
tokenize_words(.x)
}
)
})

usethis::use_data(abstract_words, compress = "xz", overwrite = TRUE)
Binary file added data/abstract_words.rda
Binary file not shown.
19 changes: 19 additions & 0 deletions man/abstract_words.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.