Skip to content

Conversation

@SophiaPerzan-DG
Copy link
Collaborator

@SophiaPerzan-DG SophiaPerzan-DG commented Jun 23, 2025

Fixes #171

If the datamanager failed when trying to resolve the path, the stack trace would get swallowed and the code would attempt to load the non-existent .cfg in order to parse it. This would then fail since the file doesn't exist, and print the misleading error message.

Created a new Error class for cases when data manager tries to find a file and the file does not exist. C++ backend uses this error type to determine which error message to print (and stack trace in the case of data manager failure)

Old behaviour:

Invalid genome

gk.Genome("hg38999")
Traceback (most recent call last):
File "", line 1, in
File "/Users/sophiaperzan/Desktop/dev/GenomeKit/genome_kit/genome.py", line 117, in new
genome = _cxx.Genome.new(cls, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: src/refg.cpp:56: Assembly is missing its chrom.sizes or annotation is missing its cfg file: hg38999
src/file.cpp:147: Could not open hg38999.cfg for reading (No such file or directory).

Valid genome, but error occurs within data manager when trying to get the file This example shows what might happen if you have no GCS credentials for the GCSDataManager

gk.Genome("hg38")
Traceback (most recent call last):
File "", line 1, in
File "/Users/sophiaperzan/Desktop/dev/GenomeKit/genome_kit/genome.py", line 117, in new
genome = _cxx.Genome.new(cls, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: src/refg.cpp:56: Assembly is missing its chrom.sizes or annotation is missing its cfg file: hg38
src/file.cpp:147: Could not open hg38.cfg for reading (No such file or directory).

New behaviour:

Invalid genome

gk.Genome("hg3999")
Traceback (most recent call last):
File "", line 1, in
File "/Users/sophiaperzan/Desktop/dev/GenomeKit/genome_kit/genome.py", line 113, in new
genome = _cxx.Genome.new(cls, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: src/refg.cpp:45: Assembly is missing its chrom.sizes or annotation is missing its cfg file: hg3999

Valid genome, but error occurs within data manager when trying to get the file This example shows what might happen if you have no GCS credentials for the GCSDataManager

gk.Genome("hg38")
Traceback (most recent call last):
File "", line 1, in
File "/Users/sophiaperzan/Desktop/dev/GenomeKit/genome_kit/genome.py", line 113, in new
genome = _cxx.Genome.new(cls, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: src/py_init.cpp:78: Traceback (most recent call last):
File "/Users/sophiaperzan/Desktop/dev/GenomeKit/genome_kit/gk_data.py", line 101, in resolve_datafile_path
path = get_file(rel_path)
^^^^^^^^^^^^^^^^^^
File "/Users/sophiaperzan/Desktop/dev/GenomeKit/genome_kit/gk_data.py", line 67, in get_file
return data_manager.get_file(filename)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sophiaperzan/Desktop/dev/GenomeKit/genome_kit/data_manager.py", line 212, in get_file
blob = self.bucket.blob(filename)
^^^^^^^^^^^
File "/Users/sophiaperzan/Desktop/dev/GenomeKit/genome_kit/data_manager.py", line 195, in bucket
gcloud_client = storage.Client()
^^^^^^^^^^^^^^^^
File "/Users/sophiaperzan/conda/envs/genomekit_test/lib/python3.12/site-packages/google/cloud/storage/client.py", line 247, in init
super(Client, self).init(
File "/Users/sophiaperzan/conda/envs/genomekit_test/lib/python3.12/site-packages/google/cloud/client/init.py", line 338, in init
_ClientProjectMixin.init(self, project=project, credentials=credentials)
File "/Users/sophiaperzan/conda/envs/genomekit_test/lib/python3.12/site-packages/google/cloud/client/init.py", line 286, in init
project = self._determine_default(project)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sophiaperzan/conda/envs/genomekit_test/lib/python3.12/site-packages/google/cloud/client/init.py", line 305, in _determine_default
return _determine_default_project(project)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sophiaperzan/conda/envs/genomekit_test/lib/python3.12/site-packages/google/cloud/_helpers/init.py", line 152, in _determine_default_project
_, project = google.auth.default()
^^^^^^^^^^^^^^^^^^^^^
File "/Users/sophiaperzan/conda/envs/genomekit_test/lib/python3.12/site-packages/google/auth/_default.py", line 685, in default
raise exceptions.DefaultCredentialsError(_CLOUD_SDK_MISSING_CREDENTIALS)
google.auth.exceptions.DefaultCredentialsError: Your default credentials were not found. To set up Application Default Credentials, see https://cloud.google.com/docs/authentication/external/set-up-adc for more information.

@SophiaPerzan-DG SophiaPerzan-DG requested review from Copilot and ovesh June 23, 2025 20:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

The PR updates error handling in refg_registry_t::as_refg to ensure resolution failures in the datamanager surface instead of being silently swallowed.

  • Replace empty catch with GK_RETHROW to propagate path-resolution errors.
  • Enhance the thrown error message with context about the failed annotation lookup.

@SophiaPerzan-DG
Copy link
Collaborator Author

Closing temporarily after discussion with @ovesh. Will close the issue via a different approach.

"Please upload it using `genome_kit.gk_data.upload_file`."
)
self.filename = filename
super().__init__(self.message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Collaborator Author

@SophiaPerzan-DG SophiaPerzan-DG Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary, but it alters the exception message printing slightly.
With call to super().__init__

genome_kit.data_manager.GKDataFileNotFoundError: hg3814123213.cfg not found!

Without

genome_kit.data_manager.GKDataFileNotFoundError: ('hg3814123213.cfg', 'hg3814123213.cfg not found!')

I'd say with is slightly more in-line with python exception printing, but it doesn't ultimately matter IMO. I can remove if you'd like.

Copy link
Collaborator

@s22chan s22chan Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is why don't we just throw the IOError?

ex:

GK_CHECK(result == 0, file, "Error seeking to position {} in file ({}).", offset, strerror(errno));

if we do want to be more fine-grained, it'd be better to update the mapping and use the existing mechanisms:

GenomeKit/src/py_util.h

Lines 98 to 108 in c073879

GKPY_CATCH_CASE(gk::assertion_error, PyExc_AssertionError, return_stmt) \
GKPY_CATCH_CASE(gk::file_error, PyExc_OSError, return_stmt) \
GKPY_CATCH_CASE(gk::type_error, PyExc_TypeError, return_stmt) \
GKPY_CATCH_CASE(gk::value_error, PyExc_ValueError, return_stmt) \
GKPY_CATCH_CASE(gk::index_error, PyExc_IndexError, return_stmt) \
GKPY_CATCH_CASE(gk::key_error, PyExc_KeyError, return_stmt) \
GKPY_CATCH_CASE(gk::memory_error, PyExc_MemoryError, return_stmt) \
GKPY_CATCH_CASE(gk::not_implemented_error, PyExc_NotImplementedError, return_stmt) \
GKPY_CATCH_CASE(gk::unreachable_code_error, PyExc_RuntimeError, return_stmt) \
GKPY_CATCH_CASE(gk::runtime_error, PyExc_RuntimeError, return_stmt) \
GKPY_CATCH_CASE(std::exception, PyExc_RuntimeError, return_stmt) \


Actually I'm having trouble understanding what's wrong with just raising FileNotFoundError and doing a rethrow for whatever we get

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this response meant for this thread? I'm not seeing the connection to calling super().__init__ for the exception. If you meant it in the sense of 'why are we defining a new exception for the data manager to throw, rather than just have it throw FileNotFound?', then I suppose that could be done. I chose to define a new exception type to differentiate it from FileNotFound in that GKDataFileNotFound signifies the data manager determined the requested file does not exist. The thinking here is if the data manager raises FileNotFound due to a different error somewhere, the c++ extension module can then tell an error occurred (and not just that a file doesn't exist). I guess it's unlikely that that might happen, but I felt it was more explicit to define a type.

Let me know if I misinterpreted your response.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to figure out in what case would you call the datamanager and you need to differentiate a filenotfound via the datamanager vs a filenotfound at levels lower than the datamanager?

Copy link
Collaborator

@s22chan s22chan Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can you show the before and the fixed version output in the PR description?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there is no case where a FileNotFound is raised lower than the data manager. However, since we leave open the possibility of custom data managers, it is possible some other future (3rd-party) implementation might throw a FileNotFound as a result of some internal error.

It would be completely possible to change the code to instead just work with the built-in FileNotFound exception, and as a result some of the code could be slightly simplified (mainly removing the few lines that store the pointer to the GKDataFileNotFoundException type). However, we would lose that slight granularity in determining whether a raised FileNotFound exception is a result of a (possible) data manager error, or a true file not found (data manager determined the file does not exist).

The consequence for losing that granularity is when a FileNotFound would be thrown, we are falling back to the old behaviour for this specific exception type (behaviour has been updated in PR description).

Do you feel one approach is preferable here?

// Check for error
if (!new_path_obj) {
GK_THROW(value, "{}", traceback_format_exc_and_clear());
if (_gk_data_file_not_found_error && PyErr_ExceptionMatches(_gk_data_file_not_found_error)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a tests for both S3 and GCS: genome that doesn't exist, missing auth (specify a non-existing bucket name).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can probably replace this with a GK_RETHROW but not 100% sure anymore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't GKRETHROW here because the exception here originates from within python, which wont be caught by the rethrow catch blocks (since they're catching c++ exceptions).

Comment on lines +55 to +57
if len(eps) > 0:
logging.warning("Failed to load the data manager plugin. "
"Falling back to the default data manager.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously if a custom data manager threw an error when attempting to load, GK would silently fallback to the default DM. Instead we warn the user.

Comment on lines +48 to +50
if len_eps > 1:
logging.info("Multiple data manager plugins found. "
f"Using the first one: {list_eps[0].name}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of multiple data managers, only the first one is used. This behaviour would occur silently. Now we inform the user.

@ovesh ovesh requested a review from s22chan July 14, 2025 19:20
@ovesh
Copy link
Contributor

ovesh commented Jul 15, 2025

Please update the PR description with terminal output of the old and new behavior.

Copy link
Contributor

@ovesh ovesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking comment, preemptively approving.

@ovesh ovesh merged commit 00d61c7 into main Sep 23, 2025
11 checks passed
@ovesh ovesh deleted the 171-datamanager-error-propagation branch September 23, 2025 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misleading error message when datamanager fails to load refg

4 participants