Skip to content

Comments

Zonelink redesign#144

Open
hbrincon wants to merge 25 commits intomainfrom
zonelink_redesign
Open

Zonelink redesign#144
hbrincon wants to merge 25 commits intomainfrom
zonelink_redesign

Conversation

@hbrincon
Copy link
Contributor

@hbrincon hbrincon commented Jan 16, 2026

A redesign of the zone link calculation to use dictionaries instead of nested lists. Also includes a bug fix for the V2 ellipticity calculation in the catalog class and the addition of optional galaxy weighting for V2.

@hbrincon hbrincon requested a review from kadglass January 16, 2026 19:24
@QuiteAFoxtrot
Copy link
Collaborator

I think I have an even better solution than parallelizing - eliminating the section entirely and just keeping track of the zone linkage volumes during the zone creation instead of going through everything a second time

@QuiteAFoxtrot
Copy link
Collaborator

QuiteAFoxtrot commented Jan 19, 2026

So I have an update, its a bit messy, so I'm not intending on having it do a pull request to merge in here, but rather to provide an alternative implementation for you to look at, but here is a possible update on its own branch: https://github.com/DESI-UR/VAST/tree/zonelink_redesign_2

The main trick is thus - as we build the zones, we have to go through every galaxy/cell anyway, so why not just look at their linkage volumes as well and keep track of the zone linkage volumes then? This relies on a key trick - sometimes we will be adding the current galaxy to a zone but its neighbors will not have been added to a zone yet, but later when those neighbors come up as the focus of adding to a zone, we can add the linkage then, and so we can do the 'redundant' thing like:

zone_linkage[curr_zone][neigh_zone] = volume
zone_linkage[neigh_zone][curr_zone] = volume

This eliminates the need for the entire zlinks loop/section from what I can tell, so no parallelization needed (unless we later find that now the Zone Creation loop is now in need of parallelization - but we can address that after this bit has been tidied up).

Something that seemed natural to me is that the maximum linkage volume for a pair of zones could easily be stored in a hashmap/dictionary using the key as a 2-tuple (int(lower_zone_ID), int(upper_zone_ID)) - the int() is because numpy vals cant be hashed, and the lower-and-upper vals to make the pairs uniquely ordered as technically (0,1) and (1,0) would be different keys into the dict.

For clarity, I rewrote the zone creation section using a dictionary where the key is a Zone_ID, and then it has natural-language fields describing the properties of the zone. This makes things clearer in my opinion, but I had a difficult time figuring out what to do with the information in the next section prevoids. I don't understand the output data structures from the next Voids class (which get called prevoids in the later sortVoids method), and so I can't tell if they're doing something important or if all the information is captured by the way I rewrote things in the Zones method.

If you have a better understanding than me about what needs to happen in Voids/prevoids creation, I think I can take another swing at optimizing.

Before I moved to eliminating the zlinks loop entirely, I had started a two-pass section where first we get all the zone linkages, and then we could have had a parallel section calculating the zone linkage volumes, and that section I left in there commented-out with a comment heading like "Teriary implementation", if you're wondering what that was for.

Lastly, I haven't addressed the visualization bits, I saw what you added in terms of normal vectors and face stuff and I think it can be added here but I was just trying to solve the core speed issue first. So take a look and let me know what you think.

@hbrincon
Copy link
Contributor Author

@QuiteAFoxtrot Thanks for the work on this! I've taken a look at the new code to familiarize myself with it.

The Voids class docstring links to the original ZOBOV paper, and my understanding is that the class is adapting section 2.3 of the paper, reproduced here for convenience:


2.3 From zones to voids

Zones are joined as follows. Imagine a 2D density field (represented as height) in a water tank. For each zone z, the water level is set to z’s minimum density, and then raised gradually. Water may flow, along lines joining Voronoi neighbours, into adjacent zones, adding them to the void defined around the zone z. The process stops when water flows into a deeper zone (with a lower minimum than z’s), or if z is the deepest void, when water floods the whole field. The final void corresponding to z is defined as the set of zones containing water just before this happens. The minimum-density (core) particle of the original zone is also the minimum-density particle of the zone’s void. Many low-significance zones fail to annex surrounding zones as they attempt to grow; a zone in this situation has a void equal to itself. The density (water level) at which water flows into a deeper zone is recorded as ρl(z) (l stands for ‘link’ to a deeper zone).


With the above analogy, the Zones class goes as far as calculating the heights of the saddle points separating the zones through which water will flow. The Voids class then takes over for the rest of the steps. The voids (aka prevoids.voids) list has one entry for every zone, and that entry is the collection of zones that are joined by pouring water into the zone before water leaks into a deeper zone. The ovols list should correspond to the multiplicative inverse of the ρl(z) values, aka the saddle point cell volumes at which point water flows into a deeper zone. Admittedly, my understanding of the logic in the Voids class is limited, so I might be getting something wrong.

VIDE and ZOBOV pruning make use of prevoids.voids, whereas REVOLVER pruning does not and simply treats the individual zones as voids. Maybe we could implement a time save for users who are only running REVOLVER by making the Voids logic optional.

I also wonder if the Voids logic could be absorbed into the zone-building loop, similar to what you did for the zone-linking. That would eliminate the need for a separate voids class. Let me know your thoughts

@hbrincon
Copy link
Contributor Author

An update: I've ran some tests and prevoids.voids does contain the information described above, but this information is stored in a more complex manner to allow for VIDE pruning. To obtain the "ZOBOV" void definition given above, the V2 code runs

voids = [[c for q in v for c in q] for v in self.prevoids.voids]

The ovols list similarly contains ρl(z) values but stored in a more complex manner to allow for VIDE pruning. I still don't understand what the logic behind the way prevoids.voids and prevoids.ovols are organized, but I'll see if I can update the code to use the new zone_link_volumes definition

@hbrincon
Copy link
Contributor Author

hbrincon commented Jan 22, 2026

I've created a new branch zonelink_redesign_3 that converts the new data structures from zonelink_redesign_2 into a usable algorithm output. I'm noticing that when I run the vsquared example script on both branches, zonelink_redesign_3 produces 619 zones, wheras zonelink_redesign produces 620 zones (same as the main branch). I've verified that the difference originates directly after the zones are created in the zonebuilding stage and not later in the code. There seems to then be some sort of differerence that results in one less zone in the new code

@hbrincon hbrincon removed the request for review from kadglass January 22, 2026 17:55
@QuiteAFoxtrot
Copy link
Collaborator

QuiteAFoxtrot commented Jan 23, 2026

Maybe something to do with the fact that the old version uses zcell = [[]] list-of-list initialization, and only the degenerate/0-volume cells ever get added to that initial entry via:

if gal_cell_vols[gal_idx] == 0.:
        gal_zone_IDs[gal_idx] = -1
        zcell[-1].append(gal_idx)
        continue

Wheras in zonelink_redesign2/3 I separated out a new list degenerate_gal_cells = [] and append inside that separate object instead?

@hbrincon
Copy link
Contributor Author

That seems to be it. Further comparing the catalogs, this change propagates into which galaxies are considered edge galaxies. The 0-volume cells are no longer being flagged as edge galaxies, since the elist array is initialized to all zeros (non-edge galaxies) and is no longer being updated for these galaxies (in fact, these were the only galaxies that were categorized as "edge galaxies" to begin with, so without them, the edge column output is all set to 0). For consistency with past versions, we may want to flag them as edge galaxies.

I'm finding that between zonelink_redesign2/3, the order of a single pair of the voids in the voids output table has been swapped. I'm not sure of the reason for the swap, but that is something to keep in mind for the unit tests

@hbrincon
Copy link
Contributor Author

hbrincon commented Jan 26, 2026

I've updated the edge galaxy calculation in zonelink_redesign_3 to match the main branch. While working, I discovered an odd behavior, where one Voronoi cell might be adjacent to a neighboring cell, but the neighboring cell is not adjacent back to the cell. E.g. for the example script, this occurred with the below cells/zones.

galaxy 5098 (zone 152) is adjacent to galaxy 11721 (zone 35)
galaxy and neighbors: 5098 [ 5096  4673  4591 11721  4671  5094  5374  4752  4675  5738 10830 11725]
converted to zones: 152 [ 33  25  25  35  -1  35 138  -1  25 152  35  25]

galaxy 11721 (zone 35) is not adjacent to 5098 (zone 152) 
galaxy and neighbors: 11721 [11725 11723 11654  4671  5094  5103 11708 11587 10830  4591 11769 11582 5095  5098  4675]
converted to zones: 35 [25 -1 25 -1 35 25 35 35 35 25 25 35 -1 -1 25]

This is coming directly from the multivoro output, so it might be an issue with multivoro (happens regardless of whether I set num_cpus to 1 or more than 1). But the multivoro voids have previously been shown to resemble the scipy tessellation voids in slice plots, so this mismatch doesn't appear to be signification to the output, at least for the example script.

I also discovered a potential bug where two saddle-point Voronoi cells could get treated as the same cell if they have the exact same volume, but the change of this happening seems negligible to me, so I've left my (slower) version of the code that would fix the bug commented out (to clarify, this potential bug is also present on the main branch).

@hbrincon
Copy link
Contributor Author

@QuiteAFoxtrot Let me know if you were planning to take another go at optimizing the branch. Apart from the Voids class (the logic of which still isn't fully clear to me), the branch might be improvable in the places where I've connected your new data structures for the zone links to subsequent steps in zobov.py. E.g. lines such as

vcuts = [list(flatten(self.zones.zcell[v])) for v in voids]

in zonelink_redesign_1/main were replaced with

 zcell = np.array([self.zones.zone_info[zone_ID]["galaxy_indices"] for zone_ID in self.zones.zone_info.keys()], dtype=object)
 vcuts = [list(flatten(zcell[v])) for v in voids]

in zonelink_redesign_3, and I'm not sure if there's a less awkward way to use the data structures when constructing the void output tables. Let me know what you think

@hbrincon
Copy link
Contributor Author

hbrincon commented Jan 27, 2026

I've figured out what the prevoids.voids list contains. prevoids.voids is a list of lists of lists, e.g.

[[[1]], [[0]], [[2], [1], [0]], [[3], [0, 1, 2]], [[4], [0, 1, 2, 3]]]

Each entry in prevoids.voids corresponds to a saddle point in the matter-density field (aka a zone link), and they are ordered by increasing zone-linking density. For each entry, the algorithm asks "if we flood the watershed to reach this saddle point, which zones contain water?" There will be two sets of zones that get flooded, one on each side of the saddle point (or potentially 3 or more if a larger number of zones meet at the saddle point, which does occur at times in the VAST example script). Each set has its own "lowest water depth" that the water reaches as it runs downhill through the zones. The algorithm ignores the set with the largest "lowest water depth" and considers the remaining set (or sets, if 3+ zones meet at the saddle point). If the remaining set contains a single zone, then the list entry contains that zone, such as with [[1]] and [[0]] in the above example (1 and 0 are the zone IDs). If the remaining set contains multiple zones, then they are included together in a "compound void", such as with [[2], [1], [0]] in the above example. However, if any of the zones in the set appeared together in a previous compound void in the list, they are grouped separately, such as with [[3], [0, 1, 2]] and [[4], [0, 1, 2, 3]] in the above example where the [3] and the [4] are the newly added zones. If both sides of the saddle point flow to the same "lowest water depth" zone, then the saddle point is excluded from the list. If 3+ zones met at the saddle point, then additional items would be added to the voids list for the remaining sets. The final entry in the list [[4], [0, 1, 2, 3]] doesn't correspond to a saddle point, but is rather the largest possible void obtained by combining all zones together (this void is created separately after the main program loop is done)

The above voids list corresponds to the below arrangement of zones. The zone ID is the number in each box. The size of the box is proportional to the largest Voronoi cell within it (used to calculate the "lowest water depth" for a given saddle point). The saddle points correspond to the letters A through E, with the order of the letters corresponding to the order that the saddle points flood as the water rises (if 3+ zones were to meet at a saddle point, then that letter would appear twice in the below list, but that isn't the case for this example).

A B C E Final void
[[[1]], [[0]], [[2], [1], [0]], [[3], [0, 1, 2]], [[4], [0, 1, 2, 3]]]

D is excluded because both sides of the saddle point flood down to the same zone (zone 3)

Here is a script that generates the above voids list while printing out each step in the program:
voids_list_example.py

20260127_163558

And here is a version where I add an extra zone so that three zones (0, 1, 5) meet at a single saddle point.
voids_list_example_2.py

[[[1]], [[0]], [[2], [1], [0]], [[5]], [[3], [0, 1, 2], [5]], [[4], [0, 1, 2, 3, 5]]]

Now as for why the voids need to be stored this way for VIDE to work, I'm still not sure about that

@QuiteAFoxtrot
Copy link
Collaborator

QuiteAFoxtrot commented Jan 31, 2026

Taking a look at your updates - just some notes:

~line 1496 in classes.py:

link_volumes_dict = dict.fromkeys(link_volumes, [])

for pair, watershed_break in zone_link_volumes.items():
            
    if pair[0] not in link_volumes_dict[watershed_break]:
                
        link_volumes_dict[watershed_break] = link_volumes_dict[watershed_break] + [pair[0]]
        
    if pair[1] not in link_volumes_dict[watershed_break]:
                
        link_volumes_dict[watershed_break] = link_volumes_dict[watershed_break] + [pair[1]]

I like your breakdown & usage of the new data structure here, but I do have a suspicion and a question - if you look at the docs for dict.fromkeys() here: https://docs.python.org/3.10/library/stdtypes.html - it says "fromkeys() is a class method that returns a new dictionary. value defaults to None. All of the values refer to just a single instance, so it generally doesn’t make sense for value to be a mutable object such as an empty list. To get distinct values, use a dict comprehension instead." Basically, you create a dictionary something like link_volumes_dict = {vol1 : list1, vol2 : list1, vol3 : list1, . . .} where all the references are to the same list, which I suspect was unintentional? Now luckily when you do link_volumes_dict[watershed_break] = link_volumes_dict[watershed_break] + [pair[0]] adding two lists together creates a new list object which then replaces the reference to the single list object which was present in all the keys, so I think your code here does work, but in a way which relies on the quirks of list addition. I think we should replace this section with something like:

link_volumes_dict = {}

for pair, watershed_break in zone_link_volumes.items():

    if watershed_break not in link_volumes_dict:
        link_volumes_dict[watershed_break] = [] #explicitly creates a new list object for each watershed break

    if pair[0] not in link_volumes_dict[watershed_break]:
        link_volumes_dict[watershed_break].append(pair[0])

    if pair[1] not in link_volumes_dict[watershed_break]:
        link_volumes_dict[watershed_break].append(pair[1])

This way there'd be no mucking about with multiple references to the same underlying list object and we can just use .append() instead of list addition (and then also avoid creating and discarding a bunch of new list objects).

@QuiteAFoxtrot
Copy link
Collaborator

QuiteAFoxtrot commented Jan 31, 2026

I also think this would be a very appropriate time to address the saddle point uniqueness bug, and possibly this zone adjacency bug that you found, though I'm not sure I have a full grasp on them just yet.

On the zone-adjacency side, it looks like Galaxy 5098 may be a 0-volume cell? In the second part of the output you posted under "converted to zones" it has a zone label of -1 instead of zone 152. If it is a zero-volume cell, maybe we should add a zero-volume filter pass before the main cell linking to ensure that these galaxies don't end up in a zone in either direction? I know we do currently check for zero-volume cells to add the -1 label during the zone creation loop but maybe we missed a detail. Might be worth asking Mark about - my gut tells me that a zero-volume cell is degenerate and should be excluded but maybe its still important to know something about which zone(s) it is neighbor to later in the code (or for visualization?) Edit: - see next comment in this thread

On the saddle point bug, is the idea that there might be a breakpoint with a value of say, 150, and that there are multiple zones which are completely spatially separate? With a goofy 1-D diagram:
|---zone1---|150|--zone2---|200|-----zone3----|250|----zone4---|150|---zone5---|
So at breakpoint 150, zones 1-2 should merge, and 4-5 should merge, but realistically zones1-2 could have merged long ago due to some other neighbor, while maybe 4-5 dont merge until the 150 breakpoint, but because of this bug they get merged at the wrong time or something?

@QuiteAFoxtrot
Copy link
Collaborator

QuiteAFoxtrot commented Jan 31, 2026

I was chatting with Kelly, and I think we might need to do a more robust job in the Tesselation of tracking which galaxies are edge cells (aka possibly feeding into this adjacency issue). Right now, we create an output_volumes array initialized to 0 for every cell/galaxy, and then in parallel we calculate the cell volumes and feed this into the output_volumes and so anything that ends up with a 0 is by omission of actually calculating the volume - aka it failed the r_min r_max or the verticies_in_mask checks. I think we should probably add an array alongside output_volumes to keep track of these guys separately, so we can be more clear about what is an edge cell versus what could be degeneracy coming out of the calculate_region_volume function, which could possibly yield an actual 0-volume cell if there were two galaxies perfectly on top of each other? Maybe an explicit array named edge_cells or edge_cell_flags or something like that.

@hbrincon
Copy link
Contributor Author

Taking a look at your updates - just some notes:

~line 1496 in classes.py:

...

I've implemented the suggested changes here in zonelink_redesign_3 (I also caught a small bug in the galzone_worker function that I'd introduced while developing zonelink_redesign_3, now I've fixed it and brought the functionality into agreement with the single-threaded version)

@hbrincon
Copy link
Contributor Author

hbrincon commented Jan 31, 2026

Good catch on galaxy 5098 appearing with a different zone ID label!

I checked, and galaxy 5098 has a cell volume of 62.31. Galaxy 11721 has a cell volume of 79.82. So it isn't either of them having 0-volume cells that's causing this bug. Something else is going on.

For the saddle point bug, yes that 1D diagram illustrates what the problem is. To my understanding the algorithm would treat zones 1, 2, 3, and 4 as if they all meet at a single saddle point, and it will combine them all into one "contiguous" void during that iteration, despite that fact that the two pairs of zones could be, for example, at completely opposite corners of a survey (edit: this isn't a problem, since the zones are only connected into a single void in the program loop if they flow into one another). I'll set a up a demo script to test if the bug works the way I believe it does, or if it's not actually an issue.

I'll look into adding an edge_cells array into the zoneline_redesign_3 branch

@hbrincon
Copy link
Contributor Author

hbrincon commented Feb 1, 2026

So here is a slightly modified version of the 1-D diagram zones. The equal height "saddle point" peaks are the two on the far left and right with circles drawn over them. The zone IDs are labeled below each zone. Here zone 0 is deeper than any other zone.
20260131_230008

The voids_list_example script's voids output for this configuration is

[[[2]], [[1]], [[3], [2]], [[4]], [[0], [1, 2, 3, 4]]]

I'm not sure if this is what we want the output to be, or if we rather want it to be

[[[2]], [[1]], [[3], [2]], [[0], [1, 2, 3], [4]]]

In the later case, the two saddle points would be treated as separate iterations in the program loop, and the deepest zones accessible from each of them (0 and 4) would be ignored until future iterations in the loop. But with the way the loop works right now, only zone 0 is ignored. I'm not actually sure what is the desired behavior here. Perhaps the code is ok as is, it depends on how ZOBOV and VIDE handle the output.

In the case of ZOBOV, the final void output from the first version (how the code currently works) is

[[2], [1], [3, 2], [4], [0, 1, 2, 3, 4]]

and the output from the modified version is

[[2], [1], [3, 2], [0, 1, 2, 3, 4]]

Recalling the definition for the ZOBOV voids

the output has one entry for every zone, and that entry is the collection of zones that are joined by pouring water into the zone before water leaks into a deeper zone

So the first (current) version appears to be correct. Perhaps then there is no bug after all!

I'm not sure about the VIDE version yet, since I have yet to decipher the logic in the VIDE pruning code (shouldn't be too bad, I just haven't gotten to it yet). But it looks like the code might be ok as is

@hbrincon
Copy link
Contributor Author

hbrincon commented Feb 6, 2026

I've updated zonelink_redesign_3 to use the edge_cells array we discussed. I'm not sure if it really makes a difference to use it instead of checking for zero volume cells (if two galaxies were on top of each other as you suggested, I'd think that multivoro would throw an error, but I haven't checked).

I also discovered why galaxy 5098 appeared with two differnet zone labels. It's because every galaxy's zone label was being set to -1 before they were processed, so if a galaxy's neighbor hadn't been processed yet, the galaxy's zone was being fictitiously linked to the -1 zone. I've changed the behavior so that this no longer happens by initializing the galaxy zones IDs to -2 (and handling neighbors with zone_ID = -2 appropriately) before updating their zones to -1 or a number >=0 when they are processed. Coincidentally, this also removes the redundant "double calculation" of the cell triangle areas that was happening before, so this change will save time.

I think we're about ready to merge zonelink_redesign_3 into main, but let me know if you have other things you'd like to do with the branch first. It would still be nice if we could parallelize* the zone-building if possible, but that would be more involved and might be best to start on another branch. Let me know what you think @QuiteAFoxtrot

*It may not be possible to parallelize the main loop, since the order in which the galaxies are iterated is intentionally tied to their cell volumes, but it may still be possible to parallelize the time-intensive triangle calculations by splitting them off into another process. Nothing else in the loop is dependent on the information calculated in the visualization code, so it may be possible to isolate it.

@QuiteAFoxtrot
Copy link
Collaborator

QuiteAFoxtrot commented Feb 8, 2026 via email

Hernan Benjamin Rincon and others added 2 commits February 8, 2026 11:26
Merge changes from `zonelink_redesign_3` into `zonelink_redesign`
@hbrincon
Copy link
Contributor Author

hbrincon commented Feb 9, 2026

I've merged zonelink_redesign_3 into this branch. The remaining changes for this branch are to update the unit tests with the new data types and to update the version info and change log. I'll request reviewers for the PR once I've implemented theses updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants