-
Notifications
You must be signed in to change notification settings - Fork 1
Use lib360dataquality for validation
#65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is great! Thanks, No problems with the code.. There is one issue that it's raised which is that each invocation of the validation will download the schema and codelists from github instead of the locally held ones. I think there is a way around this but needs a bit of thinking |
|
So I'm thinking in lib360dataquality we could modify it to take the file paths that are cached in the datagetter https://github.com/ThreeSixtyGiving/datagetter/blob/tw/lib360dataquality/getter/get.py#L394 and pass them to init of e.g. diff --git a/lib360dataquality/cove/schema.py b/lib360dataquality/cove/schema.py
index e560618..5ea54c8 100644
--- a/lib360dataquality/cove/schema.py
+++ b/lib360dataquality/cove/schema.py
@@ -28,7 +28,11 @@ class Schema360(SchemaJsonMixin):
_pkg_schema_obj = {}
_schema_obj = {}
- def __init__(self, data_dir) -> None:
+ def __init__(self, data_dir, local_pkg_schema_path=None, local_grant_schema_path=None) -> None:
+ """ data_dir: The directory with the data
+ local_pkg_schema: If specified will use the path provided instead of fetching from remote
+ local_grant_schema_path: If specified will use the path provided instead of fetching from remote
+ """
# Create dedicated location for schema work
self.working_dir = os.path.join(data_dir, "schema")
try:
@@ -36,6 +40,17 @@ class Schema360(SchemaJsonMixin):
except FileExistsError:
pass
+ if local_pkg_schema_path:
+ # copy schema to working dir
+ # setup self._pkg_schema_obj
+
+ if local_grant_schema_path:
+ # copy schema to woring dir
+ # setup self._schema_obj
+
+ # Don't do the remote fetching stuff if local^
+
+
# required by lib-cove for CustomRefResolver the trailing / is needed to make sure
# urljoin does not discard the final part of the location.
self.schema_host = f"{self.working_dir}/"I'm also wondering if there is more scope to de-duplicate some code as lib360dataquality also has the convert/unflatten functions (e.g. from xlsx -> json), though that's not a priority here it might be useful context. |
|
For testing debugging of the requests I was doing: diff --git a/getter/get.py b/getter/get.py
index b8c3560..87b815e 100644
--- a/getter/get.py
+++ b/getter/get.py
@@ -18,6 +18,26 @@ from lib360dataquality.cove.schema import Schema360
from lib360dataquality.cove.threesixtygiving import common_checks_360
import getter.cache as cache
+import logging
+
+# These two lines enable debugging at httplib level (requests->urllib3->http.client)
+# You will see the REQUEST, including HEADERS and DATA, and RESPONSE with HEADERS but without DATA.
+# The only thing missing will be the response.body which is not logged.
+try:
+ import http.client as http_client
+except ImportError:
+ # Python 2
+ import httplib as http_client
+http_client.HTTPConnection.debuglevel = 1
+
+# You must initialize logging, otherwise you'll not see debug output.
+logging.basicConfig()
+logging.getLogger().setLevel(logging.DEBUG)
+requests_log = logging.getLogger("requests.packages.urllib3")
+requests_log.setLevel(logging.DEBUG)
+requests_log.propagate = True
+
+
acceptable_licenses = [
"http://www.opendefinition.org/licenses/odc-pddl",
"https://creativecommons.org/publicdomain/zero/1.0/", |
michaelwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to sort out the issue in lib360dataquality
|
@michaelwood just to check whether I understand, is the problem that on initialising |
|
@tillywoodfield Yes that's correct. It would be ideal to only initialise |
|
Just having a go at also using convert_spreadsheet from lib360dataquality to further reduce the duplicated code. Also updated the implementation for validate to use the work in ThreeSixtyGiving/dataquality#148 |
|
note for self the path isn't being set correctly: Also need to make sure that tests do not use caching |
This simplifies the flow of `get` by removing options branches that cause the code to be difficult to read.
This will be used when testing to make sure any caches are removed between test runs.
This reduces any code duplication between the DQT and the datagetter which may result in differing behaviours. This also adds support for extensions.
073b3ec to
44ff207
Compare
R2ZER0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not especially familiar this code yet but overall looks good 👍
|
I'm going to close this PR as part of my wrapping up but the branch is still there if you want to open another one :) |
Fixes #57