Skip to content

Conversation

@musaqlain
Copy link

@musaqlain musaqlain commented Jan 25, 2026

Achieved an ~8.2x speedup (9.5 min -> 1.1 min) in dataset validation by reducing disk I/O and vectorizing coordinate checks.

Changes

  • Optimized I/O: Caches image dimensions to avoid repeated file opens.
  • Vectorization: Replaced iterrows with Pandas boolean masking.
  • Unified Error Reporting: Merged checks for negative/OOB coordinates into a single report that lists all invalid boxes to aid debugging.
  • Polygon Support: Automatically converts non-rectangular geometries to valid bounding boxes.

Fixes #1244

AI-Assisted Development

  • I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • I understand all the code I'm submitting, thanks

@musaqlain musaqlain changed the title Vectorize coordinate validation for speedup on large datasets [WIP]: Vectorize coordinate validation for speedup on large datasets Jan 25, 2026
@musaqlain musaqlain changed the title [WIP]: Vectorize coordinate validation for speedup on large datasets Vectorize coordinate validation for speedup on large datasets Jan 25, 2026
@bw4sz
Copy link
Collaborator

bw4sz commented Jan 28, 2026

Thanks for your contribution, lets wait for the tests to pass.

@bw4sz
Copy link
Collaborator

bw4sz commented Jan 28, 2026

@jveitchmichaelis, my general philosophy has always been to focus DeepForest on novice users and help avoid errors before they happen. So in this case there is a raise if there are bad coordinates in the annotations. I think that is still right and not too annoying, instead of just printing or warning. I think we want to prioritize, 'hey there is a problem here' versus streamlined and betting the user understands what they are doing. Agreed?

@bw4sz bw4sz self-requested a review January 28, 2026 18:00
Copy link
Collaborator

@bw4sz bw4sz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, thanks.

@bw4sz bw4sz self-requested a review January 28, 2026 18:36
Copy link
Collaborator

@bw4sz bw4sz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a look at test failures.

@musaqlain musaqlain requested a review from bw4sz January 28, 2026 19:49
Copy link
Collaborator

@jveitchmichaelis jveitchmichaelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested some of the LLM-style comments be cleaned up.

It would be interesting to run this with a profiler. I assume 90+% of the time is spent opening images (and so caching is the real win to avoid reloading images). Hence why I don't think vectorizing is the most important aspect here, even though it's definitely an improvement.

I tested filtering a 100M row numpy array on Colab, takes a couple of seconds. So minutes of runtime is likely I/O.

@musaqlain musaqlain changed the title Vectorize coordinate validation for speedup on large datasets Optimize coordinate validation (I/O Caching + Vectorization) Jan 28, 2026
@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jan 28, 2026

@bw4sz sorry to belabor this review. If the goal is to show users where issues are, I think it's important that the output is actually useful for debugging. Seeing that some number of boxes are malformed, or only their coordinates, doesn't help me fix what's wrong with my dataset. What do you think about reporting a complete list of image name / box issues? e.g.

Errors:
image_name, box coords, error

In the existing code we print an error string for all invalid boxes, while this PR reports a number only for negative coords:

errors.append(f"Found {bad_count} annotations with negative coordinates.")

@musaqlain If you move the logic for negative coordinates into the oob_mask I think that would be better.

@musaqlain
Copy link
Author

completely makes sense, i have updated the logic to catch both the -ve coordinates as well as out of bound coordinates to give users a detailed report. thanks cc @jveitchmichaelis

@musaqlain musaqlain changed the title Optimize coordinate validation (I/O Caching + Vectorization) Optimize coordinate validation (I/O Caching + Vectorization) & Unified Error Reporting Jan 28, 2026
@musaqlain musaqlain changed the title Optimize coordinate validation (I/O Caching + Vectorization) & Unified Error Reporting optimize coordinate validation, handling both -ve and OOB error Jan 28, 2026
@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.21%. Comparing base (3146b96) to head (7f018f5).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1285      +/-   ##
==========================================
- Coverage   87.89%   87.21%   -0.69%     
==========================================
  Files          20       20              
  Lines        2776     2784       +8     
==========================================
- Hits         2440     2428      -12     
- Misses        336      356      +20     
Flag Coverage Δ
unittests 87.21% <100.00%> (-0.69%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@musaqlain
Copy link
Author

musaqlain commented Feb 2, 2026

PTAL cc @jveitchmichaelis

@bw4sz bw4sz added the API This tag is used for small improvements to the readability and usability of the python API. label Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API This tag is used for small improvements to the readability and usability of the python API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate annotations takes a very long time for large datasets.

3 participants