Skip to content

Conversation

@tillywoodfield
Copy link
Contributor

Fixes #57

@michaelwood
Copy link
Member

This is great! Thanks, No problems with the code.. There is one issue that it's raised which is that each invocation of the validation will download the schema and codelists from github instead of the locally held ones. I think there is a way around this but needs a bit of thinking

@michaelwood
Copy link
Member

So I'm thinking in lib360dataquality we could modify it to take the file paths that are cached in the datagetter https://github.com/ThreeSixtyGiving/datagetter/blob/tw/lib360dataquality/getter/get.py#L394

and pass them to init of Schema360

e.g.

diff --git a/lib360dataquality/cove/schema.py b/lib360dataquality/cove/schema.py
index e560618..5ea54c8 100644
--- a/lib360dataquality/cove/schema.py
+++ b/lib360dataquality/cove/schema.py
@@ -28,7 +28,11 @@ class Schema360(SchemaJsonMixin):
     _pkg_schema_obj = {}
     _schema_obj = {}
 
-    def __init__(self, data_dir) -> None:
+    def __init__(self, data_dir, local_pkg_schema_path=None, local_grant_schema_path=None) -> None:
+        """ data_dir: The directory with the data
+            local_pkg_schema: If specified will use the path provided instead of fetching from remote
+            local_grant_schema_path: If specified will use the path provided instead of fetching from remote
+        """
         # Create dedicated location for schema work
         self.working_dir = os.path.join(data_dir, "schema")
         try:
@@ -36,6 +40,17 @@ class Schema360(SchemaJsonMixin):
         except FileExistsError:
             pass
 
+        if local_pkg_schema_path:
+            # copy schema to working dir
+            # setup self._pkg_schema_obj
+
+        if local_grant_schema_path:
+            # copy schema to woring dir
+            # setup self._schema_obj 
+
+        # Don't do the remote fetching stuff if local^
+
+
         # required by lib-cove for CustomRefResolver the trailing / is needed to make sure
         # urljoin does not discard the final part of the location.
         self.schema_host = f"{self.working_dir}/"

I'm also wondering if there is more scope to de-duplicate some code as lib360dataquality also has the convert/unflatten functions (e.g. from xlsx -> json), though that's not a priority here it might be useful context.

@michaelwood
Copy link
Member

For testing debugging of the requests I was doing:

diff --git a/getter/get.py b/getter/get.py
index b8c3560..87b815e 100644
--- a/getter/get.py
+++ b/getter/get.py
@@ -18,6 +18,26 @@ from lib360dataquality.cove.schema import Schema360
 from lib360dataquality.cove.threesixtygiving import common_checks_360
 import getter.cache as cache
 
+import logging
+
+# These two lines enable debugging at httplib level (requests->urllib3->http.client)
+# You will see the REQUEST, including HEADERS and DATA, and RESPONSE with HEADERS but without DATA.
+# The only thing missing will be the response.body which is not logged.
+try:
+    import http.client as http_client
+except ImportError:
+    # Python 2
+    import httplib as http_client
+http_client.HTTPConnection.debuglevel = 1
+
+# You must initialize logging, otherwise you'll not see debug output.
+logging.basicConfig()
+logging.getLogger().setLevel(logging.DEBUG)
+requests_log = logging.getLogger("requests.packages.urllib3")
+requests_log.setLevel(logging.DEBUG)
+requests_log.propagate = True
+
+
 acceptable_licenses = [
     "http://www.opendefinition.org/licenses/odc-pddl",
     "https://creativecommons.org/publicdomain/zero/1.0/",

Copy link
Member

@michaelwood michaelwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to sort out the issue in lib360dataquality

@tillywoodfield
Copy link
Contributor Author

@michaelwood just to check whether I understand, is the problem that on initialising Schema360 it will download the schema, and therefore will do this on every call of validate()? In which case, could this also be solved by refactoring the code in datagetter to only instantiate Schema360 once?

@michaelwood
Copy link
Member

@tillywoodfield Yes that's correct. It would be ideal to only initialise Schema360 once however each one has a working directory, if any schema extensions are present the working dir is the place the schema extensions are "resolved" or "patched" so there is a need for each file to have it's own working directory.

@michaelwood
Copy link
Member

Just having a go at also using convert_spreadsheet from lib360dataquality to further reduce the duplicated code. Also updated the implementation for validate to use the work in ThreeSixtyGiving/dataquality#148

@michaelwood
Copy link
Member

note for self the path isn't being set correctly:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/jsonref.py", line 130, in callback
    base_doc = self.loader(uri)
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/jsonref.py", line 255, in jsonloader
    with urlopen(uri) as content:
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1489, in file_open
    return self.open_local_file(req)
  File "/usr/lib/python3.8/urllib/request.py", line 1528, in open_local_file
    raise URLError(exp)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/home/michael/dev/ods/threesixtygiving/datagetter/data/validation/a00P400000IplddIAB/schema/data/validation/a00P400000IplddIAB/schema/360-giving-schema.json'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/libcove/lib/exceptions.py", line 28, in wrapper
    return func(request, *args, **kwargs)
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/libcove/lib/converters.py", line 126, in convert_spreadsheet
    flattentool.unflatten(input_name, **flattentool_options)
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/flattentool/__init__.py", line 288, in unflatten
    parser.parse()
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/flattentool/schema.py", line 192, in parse
    for field, title in fields:
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/flattentool/schema.py", line 309, in parse_schema_dict
    type_set = get_property_type_set(property_schema_dict["items"])
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/flattentool/schema.py", line 30, in get_property_type_set
    property_type = property_schema_dict.get("type", [])
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/proxytypes.py", line 162, in __getattribute__
    return getattr(self.__subject__, attr)
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/proxytypes.py", line 163, in __getattribute__
    return _oga(self, attr)
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/proxytypes.py", line 121, in wrapper
    return method(self, *args, **kwargs)
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/proxytypes.py", line 243, in __subject__
    self.cache = super(LazyProxy, self).__subject__
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/proxytypes.py", line 121, in wrapper
    return method(self, *args, **kwargs)
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/proxytypes.py", line 227, in __subject__
    return self.callback()
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/proxytypes.py", line 121, in wrapper
    return method(self, *args, **kwargs)
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/jsonref.py", line 132, in callback
    raise self._error(
jsonref.JsonRefError: Error while resolving `file:///home/michael/dev/ods/threesixtygiving/datagetter/data/validation/a00P400000IplddIAB/schema/data/validation/a00P400000IplddIAB/schema/360-giving-schema.json`: URLError: <urlopen error [Errno 2] No such file or directory: '/home/michael/dev/ods/threesixtygiving/datagetter/data/validation/a00P400000IplddIAB/schema/data/validation/a00P400000IplddIAB/schema/360-giving-schema.json'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/michael/dev/ods/threesixtygiving/datagetter/getter/get.py", line 303, in fetch_and_convert
    convert_spreadsheet_file(
  File "/home/michael/dev/ods/threesixtygiving/datagetter/getter/get.py", line 133, in convert_spreadsheet_file
    convert_spreadsheet(
  File "/home/michael/dev/ods/threesixtygiving/datagetter/.ve/lib/python3.8/site-packages/libcove/lib/exceptions.py", line 31, in wrapper
    raise CoveInputDataError(wrapped_err=err)
libcove.lib.exceptions.CoveInputDataError

Also need to make sure that tests do not use caching

This simplifies the flow of `get` by removing options branches that cause
the code to be difficult to read.
This will be used when testing to make sure any caches are removed
between test runs.
This reduces any code duplication between the DQT and the datagetter
which may result in differing behaviours.
This also adds support for extensions.
@michaelwood michaelwood force-pushed the tw/lib360dataquality branch from 073b3ec to 44ff207 Compare April 17, 2025 15:01
@michaelwood michaelwood marked this pull request as ready for review April 17, 2025 15:09
@michaelwood michaelwood requested review from R2ZER0 and michaelwood and removed request for michaelwood April 17, 2025 15:09
Copy link
Contributor

@R2ZER0 R2ZER0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not especially familiar this code yet but overall looks good 👍

@tillywoodfield
Copy link
Contributor Author

I'm going to close this PR as part of my wrapping up but the branch is still there if you want to open another one :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Swap out basic json schema validation with lib360dataquality

4 participants