feat: set divisions to unknown if we expect a read error report in `uproot.dask` #1543

ikrommyd · 2025-12-21T14:42:05Z

Following Martin's suggestion here: dask-contrib/dask-awkward#592 (comment), I am setting the divisions to unknown if we have an uproot read error report.
The reason is that when we have such a report object, we do not actually know if we'll be able to read all of our partitions. One may have 2 partitions [[0,20], [20,40]] and manage to read only one of them due to some data access error that the report is meant to catch and tell you about it later.
However, if we have known divisions, dask-awkward can do certain optimizations that will give back wrong results if reading a partition fails. it will tell you for example that dak.num(events, axis=0) is 40 just because it knows the divisions without calculating it during computation. If the user does an operation like some average = some quantity / dak.num(events, axis=0), then that is wrong because the denominator will be 40 but the user actually managed to read 20 events, not 40 and the numerator is calculated from 20 events only.

Before:

In [7]: events, report = uproot.dask({"../coffea/tests/samples/nano_dy.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]}}, allow_read_errors_with_report=True)

In [8]: dak.num(events, axis=0)
Out[8]: dask.awkward<num, type=Scalar, dtype=int64, known_value=40>

In [9]: dak.num(events, axis=0).compute()
Out[9]: 40

After:

In [5]: events, report = uproot.dask({"../coffea/tests/samples/nano_dy.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]}}, allow_read_errors_with_report=True)

In [6]: dak.num(events, axis=0)
Out[6]: dask.awkward<numaxis0, type=Scalar, dtype=int64>

In [7]: dak.num(events, axis=0).compute()
Out[7]: np.int64(40)

If we don't want the report, nothing changes with this PR:

In [9]: events = uproot.dask({"../coffea/tests/samples/nano_dy.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]}}, allow_read_errors_with_report=False)

In [10]: dak.num(events, axis=0)
Out[10]: dask.awkward<num, type=Scalar, dtype=int64, known_value=40>

In [11]: dak.num(events, axis=0).compute()
Out[11]: 40

Let me know if this is not the right way to implement this in uproot.dask and there is a better way.

ikrommyd · 2025-12-23T00:03:23Z

cc @martindurant @ariostas

ariostas

Thank you, @ikrommyd! That makes sense and looks good to me. Let's see if @martindurant has any comments.

martindurant · 2025-12-23T18:09:10Z

Yes, we discussed this at some point. It smells a bit wrong to throw away the information, but I suppose it's all you can do when the number of rows you actually end up processing isn't what you would have expected.

martindurant · 2025-12-23T18:10:16Z

Note: there are some methods to discover and set the divisions attribute again, so user beware. I don't think they would be called in normal uproot operations.

ikrommyd · 2025-12-23T18:11:37Z

Re discvoer in this case means doing computation right? Like eager_compute_divisions (or however it's called). As far as I know, uproot just serves this to the user either directly or through coffea and doesn't do anything else.

ikrommyd · 2025-12-23T18:18:35Z

If you mean this, I think this is fine. It's up to the user if they want to run this and I actually don't think any user thinks or wants to think about divisions in general. @martindurant do you know if this involves opening the files? If it does, it's perfect because a bad file would error in this case.

In [3]: events
Out[3]: dask.awkward<from-uproot, type='## * NanoEvents', npartitions=2>

In [4]: events.divisions
Out[4]: (None, None, None)

In [5]: events.eager_compute_divisions
Out[5]: <bound method Array.eager_compute_divisions of dask.awkward<from-uproot, type='## * NanoEvents', npartitions=2>>

In [6]: events.eager_compute_divisions()

In [7]: events.divisions
Out[7]: (0, np.int64(20), np.int64(40))

In [8]: import dask_awkward as dak

In [9]: dak.num(events, axis=0)
Out[9]: dask.awkward<num, type=Scalar, dtype=int64, known_value=40>

ikrommyd · 2025-12-23T18:24:59Z

Oh this is good, if we want a report, it skips the divisions it could not open. So if I delete the file, it gives back the proper length

In [4]: events, report = NanoEventsFactory.from_root({"~/nano1.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]},
      ⋮ "~/nano2.root": {"object_path": "Events", "steps": [[0, 20], [20, 40]]}}, mode="dask", uproot_options={"allow_read_erro
      ⋮ rs_with_report": True}).events()

In [5]: events
Out[5]: dask.awkward<from-uproot, type='## * NanoEvents', npartitions=4>

In [6]: events._divisions
Out[6]: (None, None, None, None, None)

In [7]: !rm /Users/iason/nano1.root

In [8]: events.eager_compute_divisions()

In [9]: events
Out[9]: dask.awkward<from-uproot, type='40 * NanoEvents', npartitions=4>

In [10]: events._divisions
Out[10]: (0, np.int64(0), np.int64(0), np.int64(20), np.int64(40))

If I delete the file do not have the report, it errors with FileNotFoundError: [Errno 2] No such file or directory: '/Users/iason/nano1.root'

martindurant · 2025-12-23T18:57:36Z

Yes, agree with all of that. .num uses the divisions only if they are known; and essentially it's the same call to update the divisions if the user requests. I agree that probably none of that matters to coffea users expecting unreadable files. The only wrinkle might be if files are intermitterntly unavailable, in which case the result of num and real analysis will only be guaranteed to agree if they are computed in the same graph.

ikrommyd · 2025-12-23T19:02:43Z

Yes, agree with all of that. .num uses the divisions only if they are known; and essentially it's the same call to update the divisions if the user requests. I agree that probably none of that matters to coffea users expecting unreadable files. The only wrinkle might be if files are intermitterntly unavailable, in which case the result of num and real analysis will only be guaranteed to agree if they are computed in the same graph.

There are intermittently unavailable files (a site not responding now may get fixed a few seconds/minutes later). Which is why I also support merging dask-contrib/dask-awkward#592 on top of this too so that the new_known_scalar optimization can never be used even if eager_compute_divisions is called manually because the divisions computed from it may not be accurate a bit later when the actual dask.compute is called. People typically do a single dask.compute call so yeah everything is typically part of the same graph due to graph optimization which will fuse things that have the same starting node (same files).

ikrommyd · 2026-01-06T14:22:45Z

I assume this should be good to go?

ariostas · 2026-01-06T14:35:50Z

Yeah, you can go ahead an merge it if you're done!

set divisions to None if we expect a report

7d9cce0

ariostas approved these changes Dec 23, 2025

View reviewed changes

ariostas added this to the 5.7.0 milestone Jan 5, 2026

Merge branch 'main' into unknown-divisions-if-report

de52c65

ikrommyd merged commit ffff317 into scikit-hep:main Jan 6, 2026
29 checks passed

ikrommyd deleted the unknown-divisions-if-report branch January 6, 2026 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: set divisions to unknown if we expect a read error report in `uproot.dask` #1543

feat: set divisions to unknown if we expect a read error report in `uproot.dask` #1543

Uh oh!

ikrommyd commented Dec 21, 2025 •

edited

Loading

Uh oh!

ikrommyd commented Dec 23, 2025

Uh oh!

ariostas left a comment

Uh oh!

martindurant commented Dec 23, 2025

Uh oh!

martindurant commented Dec 23, 2025

Uh oh!

ikrommyd commented Dec 23, 2025

Uh oh!

ikrommyd commented Dec 23, 2025 •

edited

Loading

Uh oh!

ikrommyd commented Dec 23, 2025 •

edited

Loading

Uh oh!

martindurant commented Dec 23, 2025

Uh oh!

ikrommyd commented Dec 23, 2025 •

edited

Loading

Uh oh!

ikrommyd commented Jan 6, 2026

Uh oh!

ariostas commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: set divisions to unknown if we expect a read error report in uproot.dask #1543

feat: set divisions to unknown if we expect a read error report in uproot.dask #1543

Uh oh!

Conversation

ikrommyd commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikrommyd commented Dec 23, 2025

Uh oh!

ariostas left a comment

Choose a reason for hiding this comment

Uh oh!

martindurant commented Dec 23, 2025

Uh oh!

martindurant commented Dec 23, 2025

Uh oh!

ikrommyd commented Dec 23, 2025

Uh oh!

ikrommyd commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikrommyd commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Dec 23, 2025

Uh oh!

ikrommyd commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikrommyd commented Jan 6, 2026

Uh oh!

ariostas commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: set divisions to unknown if we expect a read error report in `uproot.dask` #1543

feat: set divisions to unknown if we expect a read error report in `uproot.dask` #1543

ikrommyd commented Dec 21, 2025 •

edited

Loading

ikrommyd commented Dec 23, 2025 •

edited

Loading

ikrommyd commented Dec 23, 2025 •

edited

Loading

ikrommyd commented Dec 23, 2025 •

edited

Loading