Skip to content

Conversation

@kairstenfay
Copy link
Contributor

@kairstenfay kairstenfay commented Sep 17, 2020

Lock the cache file in python rather than relying on the user knowing to lock at runtime with shell's flock, etc.

  • Update backoffice cron jobs accordingly

Open questions:
How do we want to handle cache collision?
Right now if I run two REDCap DET ETL processes operating on the same cache, I sometimes get

_pickle.UnpicklingError: pickle data was truncated

Perhaps this means my implementation isn't working as expected?
Which command(s) do we want to use when creating a lock?

@kairstenfay kairstenfay requested a review from tsibley September 17, 2020 20:58
from smartystreets_python_sdk.us_street import Lookup
from smartystreets_python_sdk.us_extract import Lookup as ExtractLookup
from id3c.cli import cli
from id3c.cli.command import pickled_cache
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This edit can be dropped from this commit.

@kairstenfay
Copy link
Contributor Author

@kairstenfay kairstenfay changed the title Redcap cache Lock cache file with python Nov 17, 2020
LOG.info(f"Loading cache from «{filename}»")
try:
with open(filename, "rb") as file:
lock_file(file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the implementation of lock_file() yet, but I think the usage here isn't doing what you expect because it's not being called as a context manager. It should be called as:

with lock_file(file):
    …

@tsibley
Copy link
Member

tsibley commented Dec 21, 2020

I think ultimately we may want to switch the geocoding caching from our in-house solution to diskcache, which I used for the Husky Musher caching. When we first added the caching, I didn't know diskcache existed and, unlike cachetools, didn't uncover it during due diligence.

@joverlee521 asked in Slack:

What's the difference between the FanoutCache used for Musher and the TTLCache used for geocoding?

and I thought it'd be useful to copy my answer here as well:

There's a couple differences:

  1. TTLCache comes from cachetools, which implements lower-level cache primitives than FanoutCache from diskcache.

  2. The TTLCache we use is entirely in-memory during execution. On top of that, we layer our own serialization/deserialization to load/save the cache. In contrast, the FanoutCache (diskcache.Cache subclass) is transactionally-safe and is loaded/saved per access. FanoutCache is a further-specialized subclass of diskcache.Cache to avoid waiting on blocking writes.

  3. TTLCache requires a TTL (expiry). Expirations are optional with FanoutCache, and we don't use them.

I found diskcache last night when looking not to reinvent the wheel. It seems like a very nice package that's robust and well-thought out. We could replace our TTLCache usage with it to make the geocoding caching a lot simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants