Skip to content

Conversation

@ikrommyd
Copy link
Collaborator

@ikrommyd ikrommyd commented Dec 21, 2025

Following Martin's suggestion here: dask-contrib/dask-awkward#592 (comment), I am setting the divisions to unknown if we have an uproot read error report.
The reason is that when we have such a report object, we do not actually know if we'll be able to read all of our partitions. One may have 2 partitions [[0,20], [20,40]] and manage to read only one of them due to some data access error that the report is meant to catch and tell you about it later.
However, if we have known divisions, dask-awkward can do certain optimizations that will give back wrong results if reading a partition fails. it will tell you for example that dak.num(events, axis=0) is 40 just because it knows the divisions without calculating it during computation. If the user does an operation like some average = some quantity / dak.num(events, axis=0), then that is wrong because the denominator will be 40 but the user actually managed to read 20 events, not 40 and the numerator is calculated from 20 events only.

Before:

In [7]: events, report = uproot.dask({"../coffea/tests/samples/nano_dy.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]}}, allow_read_errors_with_report=True)

In [8]: dak.num(events, axis=0)
Out[8]: dask.awkward<num, type=Scalar, dtype=int64, known_value=40>

In [9]: dak.num(events, axis=0).compute()
Out[9]: 40

After:

In [5]: events, report = uproot.dask({"../coffea/tests/samples/nano_dy.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]}}, allow_read_errors_with_report=True)

In [6]: dak.num(events, axis=0)
Out[6]: dask.awkward<numaxis0, type=Scalar, dtype=int64>

In [7]: dak.num(events, axis=0).compute()
Out[7]: np.int64(40)

If we don't want the report, nothing changes with this PR:

In [9]: events = uproot.dask({"../coffea/tests/samples/nano_dy.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]}}, allow_read_errors_with_report=False)

In [10]: dak.num(events, axis=0)
Out[10]: dask.awkward<num, type=Scalar, dtype=int64, known_value=40>

In [11]: dak.num(events, axis=0).compute()
Out[11]: 40

Let me know if this is not the right way to implement this in uproot.dask and there is a better way.

@ikrommyd
Copy link
Collaborator Author

cc @martindurant @ariostas

Copy link
Member

@ariostas ariostas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ikrommyd! That makes sense and looks good to me. Let's see if @martindurant has any comments.

@martindurant
Copy link

Yes, we discussed this at some point. It smells a bit wrong to throw away the information, but I suppose it's all you can do when the number of rows you actually end up processing isn't what you would have expected.

@martindurant
Copy link

Note: there are some methods to discover and set the divisions attribute again, so user beware. I don't think they would be called in normal uproot operations.

@ikrommyd
Copy link
Collaborator Author

Re discvoer in this case means doing computation right? Like eager_compute_divisions (or however it's called). As far as I know, uproot just serves this to the user either directly or through coffea and doesn't do anything else.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Dec 23, 2025

If you mean this, I think this is fine. It's up to the user if they want to run this and I actually don't think any user thinks or wants to think about divisions in general. @martindurant do you know if this involves opening the files? If it does, it's perfect because a bad file would error in this case.

In [3]: events
Out[3]: dask.awkward<from-uproot, type='## * NanoEvents', npartitions=2>

In [4]: events.divisions
Out[4]: (None, None, None)

In [5]: events.eager_compute_divisions
Out[5]: <bound method Array.eager_compute_divisions of dask.awkward<from-uproot, type='## * NanoEvents', npartitions=2>>

In [6]: events.eager_compute_divisions()

In [7]: events.divisions
Out[7]: (0, np.int64(20), np.int64(40))

In [8]: import dask_awkward as dak

In [9]: dak.num(events, axis=0)
Out[9]: dask.awkward<num, type=Scalar, dtype=int64, known_value=40>

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Dec 23, 2025

Oh this is good, if we want a report, it skips the divisions it could not open. So if I delete the file, it gives back the proper length

In [4]: events, report = NanoEventsFactory.from_root({"~/nano1.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]},
      ⋮ "~/nano2.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]}}, mode="dask", uproot_options={"allow_read_erro
      ⋮ rs_with_report": True}).events()

In [5]: events
Out[5]: dask.awkward<from-uproot, type='## * NanoEvents', npartitions=4>

In [6]: events._divisions
Out[6]: (None, None, None, None, None)

In [7]: !rm /Users/iason/nano1.root

In [8]: events.eager_compute_divisions()

In [9]: events
Out[9]: dask.awkward<from-uproot, type='40 * NanoEvents', npartitions=4>

In [10]: events._divisions
Out[10]: (0, np.int64(0), np.int64(0), np.int64(20), np.int64(40))

If I delete the file do not have the report, it errors with FileNotFoundError: [Errno 2] No such file or directory: '/Users/iason/nano1.root'

@martindurant
Copy link

Yes, agree with all of that. .num uses the divisions only if they are known; and essentially it's the same call to update the divisions if the user requests. I agree that probably none of that matters to coffea users expecting unreadable files. The only wrinkle might be if files are intermitterntly unavailable, in which case the result of num and real analysis will only be guaranteed to agree if they are computed in the same graph.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Dec 23, 2025

Yes, agree with all of that. .num uses the divisions only if they are known; and essentially it's the same call to update the divisions if the user requests. I agree that probably none of that matters to coffea users expecting unreadable files. The only wrinkle might be if files are intermitterntly unavailable, in which case the result of num and real analysis will only be guaranteed to agree if they are computed in the same graph.

There are intermittently unavailable files (a site not responding now may get fixed a few seconds/minutes later). Which is why I also support merging dask-contrib/dask-awkward#592 on top of this too so that the new_known_scalar optimization can never be used even if eager_compute_divisions is called manually because the divisions computed from it may not be accurate a bit later when the actual dask.compute is called. People typically do a single dask.compute call so yeah everything is typically part of the same graph due to graph optimization which will fuse things that have the same starting node (same files).

@ariostas ariostas added this to the 5.7.0 milestone Jan 5, 2026
@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jan 6, 2026

I assume this should be good to go?

@ariostas
Copy link
Member

ariostas commented Jan 6, 2026

Yeah, you can go ahead an merge it if you're done!

@ikrommyd ikrommyd merged commit ffff317 into scikit-hep:main Jan 6, 2026
29 checks passed
@ikrommyd ikrommyd deleted the unknown-divisions-if-report branch January 6, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants