-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.Rmd
More file actions
175 lines (135 loc) · 5.98 KB
/
README.Rmd
File metadata and controls
175 lines (135 loc) · 5.98 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
message = TRUE
)
library(gitpins)
stopifnot(!file.exists(here::here("gitpins")))
```
# gitpins
<!-- badges: start -->
[](https://github.com/torfason/gitpins/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
`r desc::desc_get("Description")`
## The Problem
You want to quickly and easily process an online resource using R functions,
some of which only accept local files. Thus you would like the following
properties for your workflow:
* Download to a local file
* But avoid downloading on every single run
* Refresh your data regularly from the online source
* Use a local copy if the online resource is not accessible
* Have the local copy be easily accessible in a predictable location
* Not ruin your local copy if the online version should change in a "bad" way
## The Solution
The `gitpins` package downloads a URL to a local file in the `gitpins` folder
(defaults to `here::here("gitpins")`, but can be configured using
`gp_options()`), and then returns the full file name name of the local file,
which can be passed as an argument to any function that expects to read such a
file.
## Installation
Install `gitpins` using `pak`:
```r
pak::pak("torfason/gitpins")
```
## Usage
```{r}
# Downloads on first try
pin("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/country_iso.csv") |>
read.csv() |> head()
```
You can maintain as many resources as you need:
```{r}
# Another resource
pin("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.month.csv") |>
read.csv() |> head()
```
The file is downloaded the first time you run `pin()` on a given URL (the
actual download is done with `curl::curl_download()`). After that, it checks to
see the age of the local file and re-downloads if it is to old. The default
refresh interval is 12 hours, but is configurable with a parameter.
Note that the return value of the `pin()` function is simply the full path to
the local copy of the file. You can therefore use `pin()` with the original
URL wherever you would have used the local path of the resource. The exact name
of the file is constructed in a deterministic way based on the URL
(specifically, the base name is the `digest()` of the URL).
```{r}
# Uses a cached copy if a recent one is available (start of the url changed for privacy)
pin("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/country_iso.csv") |>
gsub(pattern=".*/(gitpins/.*)", replacement="/home/user/project/\\1")
```
The refresh interval is configured with the `refresh_hours` parameter. Use
`refresh_hours=0` to force a download on every call, and `refresh_hours=Inf` to
always use the local copy (after the first download). A helper function,
`gp_dropper()` is provided for the case where a new version of the resource
"drops" at the same time every day. The function allows you to set a lower
interval in a given time window after the expected drop time, to maximize the
probability that an updated version gets downloaded quickly.
```{r}
# Force a reload by specifying zero refresh time
pin("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/country_iso.csv",
refresh_hours = 0) |>
gsub(pattern=".*/(gitpins/.*)", replacement="/home/user/project/\\1")
# Always use local copy by specifying Inf refresh time
pin("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/country_iso.csv",
refresh_hours = Inf) |>
gsub(pattern=".*/(gitpins/.*)", replacement="/home/user/project/\\1")
# Set a lower interval for a given time window after a resource update "drops"
pin("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/country_iso.csv",
refresh_hours = gp_dropper(drop_hour = 12, drop_tz = "US/Eastern")) |>
gsub(pattern=".*/(gitpins/.*)", replacement="/home/user/project/\\1")
```
The `gitpins` directory is actually a local `git` repository, and each new
version is committed to the repository. That way, a complete history of the
downloads is kept, but if the resource is not changing a lot, this history will
not take up an inordinate amount of space (because of the deduplication
properties of `git`).
If the resource gets borked, you can retrieve older versions using git. A
function is provided to list available pins (with or without history), but
beyond that, the user is expected to use `git` directly for more complex
retrieval operations.
```{r}
withr::with_options(list(width = 130), {
gp_list()
gp_list(history = TRUE)
})
```
## Function Name Conflicts
For use with with another package that also defines a `pin()` function (such as
the `pins` package), the `conflicted` package comes highly recommended, but the
`exclude` option of the `library()` function is also a valid approach. In either
case, the `gp_pin()` function is provided as an alias for `pin()` so you don't
need to specify the full package name on each call:
### Using `conflicted`
```r
library(conflicted)
conflicts_prefer(pins::pin())
library(pins)
library(gitpins)
gp_pin(URL)
```
### Using `exclude`
```r
library(pins)
library(gitpins, exclude="pin")
gp_pin(URL)
```
## Related Packages, System Requirements, and Feedback
This package was inspired by the `pins` package, and in particular the
`pins::pin()` function. However, that function stores the actual local file in a
system location rather than inside the project, so using it did not prove
reliable. Furthermore, it did not have the desired versioning properties, and
finally, it is now defined as a legacy function and is not part of the new api
for that package. As a result, `gitpins` was born.
Note that `gitpins` uses the native pipe operator (`|>`) and so depends on `R
(>= 4.1.0)`.
For feature requests, bugs, or other feedback, feel free to file an issue.