Skip to content

Conversation

@JedGrabman
Copy link
Contributor

@JedGrabman JedGrabman commented Nov 12, 2020

Description:
Improve efficiency of covidcast_days by requesting multiple days of data when it can be determined that it is safe to do so.

Change Log:

  • covidcast.R
    • Update covidcast_days to query multiple days at once
    • Create max_geo_values to find upper bounds on number of results returned
    • Create warning messages for missing dates or missing geo_values
  • covidcast.R
    • Add tests for added batching functionality
    • Add basic documention for the use of mock and stub

max_locations <- meta_info[meta_info$data_source == data_source &
meta_info$signal == signal &
meta_info$geo_type == geo_type, ]$num_locations
if (length(max_locations) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about

  • why the max locations would be empty and
  • why we default to number of geo values

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's a pre-emptive fix for a similar effect of #270, but yeah, a comment would be good here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, it's a defensive measure against signals that aren't documented in covidcast_meta and using the max as an upper bound when it can't be found. Added a comment.

} else {
nissues <- 1
}
max_days_at_time <- floor(3649 / (ngeos * nissues))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number would be better refactored into a file-level constant who variable name is in all caps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (max_days_at_time == 0) {
max_days_at_time <- 1
}
batch_days <- ceiling(ndays / max_days_at_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually just the number of batches? If so, would prefer to see this variable name changed to reflect that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

geo_value = "*",
as_of = NULL,
issues = NULL,
lag = NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be indented more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sgsmob
Copy link
Contributor

sgsmob commented Nov 12, 2020

Please add a description of this PR.

@capnrefsmmat
Copy link
Contributor

This will also run into conflicts with #271, mainly just in the unit tests (should be easy to deal with). We should plan the order of operations here when we merge.

(but thanks -- this should be a great benefit to download speed)

@JedGrabman JedGrabman requested a review from sgsmob November 13, 2020 19:22
Copy link
Contributor

@sgsmob sgsmob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the minor newline nit, but otherwise this is good to go.

),
regexp = NA)
expect_called(m, 2)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a newline here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@capnrefsmmat capnrefsmmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks. Let's not merge yet -- I'd like to merge into r-pkg-devel instead of main, so let me adjust and get that working, and in the mean time you can fix a couple small nits

api_msg = dat[[i]]$message,
class = "covidcast_missing_geo_values"
)
msg = dat[[i]]$message,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to go back to being called api_msg; maybe this changed back when you rebased on #252?

Suggested change
msg = dat[[i]]$message,
api_msg = dat[[i]]$message,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some merge conflicts early on, seems likely I grabbed the wrong line there. Fixed.

grDevices,
httr,
jsonlite,
lubridate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lubridate is only used in the tests, right? I think that means it can go in Suggests instead of Imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ymd is a lubricate function which I used a few times in covidcast.R

Various small code fixes

- Add comments
- Fix indenting
- Factor out constant (MAX_RESULTS)
- covidcast_days now takes max_geos as an optional parameter instead of determining it internally
- max_geo_values is now specific_meta, for retrieving more than just num_locations.
@JedGrabman JedGrabman force-pushed the dev-covidcast-efficiency branch from b1e64d3 to 2eab488 Compare November 20, 2020 19:05
@capnrefsmmat capnrefsmmat changed the base branch from main to r-pkg-devel November 23, 2020 22:04
@capnrefsmmat capnrefsmmat merged commit 3de3b81 into cmu-delphi:r-pkg-devel Nov 23, 2020
@capnrefsmmat
Copy link
Contributor

Well, that was satisfying to test out:

> foo = covidcast_signal("fb-survey", "smoothed_whh_cmnty_cli", start_day="2020-10-01", end_day="2020-11-20", geo_type="state")
Fetched day 2020-10-01 to 2020-11-20: 1, success, num_entries = 2601

That would have been 50 separate API calls before, so this is very nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

covidcast::covidcast_signal - substantial speed improvement possible

5 participants