-
Notifications
You must be signed in to change notification settings - Fork 88
feat: set divisions to unknown if we expect a read error report in uproot.dask
#1543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: set divisions to unknown if we expect a read error report in uproot.dask
#1543
Conversation
ariostas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @ikrommyd! That makes sense and looks good to me. Let's see if @martindurant has any comments.
|
Yes, we discussed this at some point. It smells a bit wrong to throw away the information, but I suppose it's all you can do when the number of rows you actually end up processing isn't what you would have expected. |
|
Note: there are some methods to discover and set the divisions attribute again, so user beware. I don't think they would be called in normal uproot operations. |
|
Re discvoer in this case means doing computation right? Like |
|
If you mean this, I think this is fine. It's up to the user if they want to run this and I actually don't think any user thinks or wants to think about divisions in general. @martindurant do you know if this involves opening the files? If it does, it's perfect because a bad file would error in this case. |
|
Oh this is good, if we want a report, it skips the divisions it could not open. So if I delete the file, it gives back the proper length If I delete the file do not have the report, it errors with |
|
Yes, agree with all of that. .num uses the divisions only if they are known; and essentially it's the same call to update the divisions if the user requests. I agree that probably none of that matters to coffea users expecting unreadable files. The only wrinkle might be if files are intermitterntly unavailable, in which case the result of num and real analysis will only be guaranteed to agree if they are computed in the same graph. |
There are intermittently unavailable files (a site not responding now may get fixed a few seconds/minutes later). Which is why I also support merging dask-contrib/dask-awkward#592 on top of this too so that the |
|
I assume this should be good to go? |
|
Yeah, you can go ahead an merge it if you're done! |
Following Martin's suggestion here: dask-contrib/dask-awkward#592 (comment), I am setting the divisions to unknown if we have an uproot read error report.
The reason is that when we have such a report object, we do not actually know if we'll be able to read all of our partitions. One may have 2 partitions [[0,20], [20,40]] and manage to read only one of them due to some data access error that the report is meant to catch and tell you about it later.
However, if we have known divisions, dask-awkward can do certain optimizations that will give back wrong results if reading a partition fails. it will tell you for example that
dak.num(events, axis=0)is 40 just because it knows the divisions without calculating it during computation. If the user does an operation likesome average = some quantity / dak.num(events, axis=0), then that is wrong because the denominator will be 40 but the user actually managed to read 20 events, not 40 and the numerator is calculated from 20 events only.Before:
After:
If we don't want the report, nothing changes with this PR:
Let me know if this is not the right way to implement this in
uproot.daskand there is a better way.