-
Notifications
You must be signed in to change notification settings - Fork 28
Bug: Don't parse annotations as documents, even if the tag is set #636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a bug where metadata fields tagged as both annotations and documents would fail to load correctly. The fix ensures that annotation fields take precedence over document fields during data loading by excluding annotation fields from document field processing.
Key changes:
- Modified query result processing to filter out annotation fields from document field conversion
- Added test coverage for the scenario where a field has both annotation and document tags
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
tests/data_engine/conftest.py |
Added PreprocessingStatus import and initialization in mock datasource setup |
tests/data_engine/annotation_import/test_annotation_parsing.py |
Added comprehensive test for annotation fields that also have document tags |
tests/data_engine/annotation_import/res/annotation1.json |
Added test fixture containing segmentation annotation data |
dagshub/data_engine/model/query_result.py |
Fixed bug by filtering annotation fields from document field processing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
simonlsk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to solve the issue thank you.
I have a suggestion to make the flow more solid, but otherwise LGTM.
| # Convert any downloaded document fields | ||
| document_fields = [f for f in fields if f in self.document_fields] | ||
| # Exclude any annotation fields, because they are already converted above | ||
| document_fields = [f for f in document_fields if f not in self.annotation_fields] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make more sense to me for every processed field to either be loaded as document or annotation here in this function, and not called twice from outside and then filter out annotations when it's called for documents.
In the current flow if we were to add a new tag that conflicts with the existing one, you would need to double exclude every field from the ones they don't belong to.
Let me know if this fix is easy to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would require a refactor, due to the way annotation errors are handled and printed out
Can totally be refactored though, just not a 5 minute refactor
In Data Engine, if you have a metadata field that is set both as an annotation and a document, loading the annotation field fails.
Example of setting an annotation AND a document (might happen automatically on the backend):
This PR fixes it, now if an annotation field also has the document field set, the annotation tag takes precedence, and it's ignored for loading documents.