enable subtomogram alignment on multiple gpus #69

alncat · 2025-04-10T11:55:25Z

Dear PyTom developers,
I have made a small modifications which enables the glocalsampling to utilize multiple gpus for multiple processes. This is achieved by create a copy of gpuIDs for each split on each process. My personal test found that this modification greatly enhances the parallelism of glocaljob.

Best Regards,
Zhenwei

sroet · 2025-04-10T13:05:03Z

Hey @alncat, thanks for the contribution!
I am a bit surprised this used to work at all, as the zip function seems to assume single gpu.
One thing that needs to be fixed is that we don't want two MPI procs claiming the same GPU (which is a possibility in the current adaptation), so we should probably use the rank of the MPI proccesses select which gpu from the gpuID list they should claim. Please let me know if you need more pointers for that implementation

alncat · 2025-04-10T13:18:34Z

@sroet , you may also check another commit about the shift determined during alignment. I multiplied it with the binning factor.

alncat · 2025-04-10T13:19:39Z

Hey @alncat, thanks for the contribution! I am a bit surprised this used to work at all, as the zip function seems to assume single gpu. One thing that needs to be fixed is that we don't want two MPI procs claiming the same GPU (which is a possibility in the current adaptation), so we should probably use the rank of the MPI proccesses select which gpu from the gpuID list they should claim. Please let me know if you need more pointers for that implementation

I will check it tomorrow.

alncat · 2025-04-11T05:06:41Z

@sroet ,

One thing that needs to be fixed is that we don't want two MPI procs claiming the same GPU (which is a possibility in the current adaptation)

I think it might be beneficial to let multiple process running on the same gpu. The alignment or average process takes less than 1GB gpu memory. Modern nvidia gpu often has at least 4GB memory. So we can at least running 4 processes on a single gpu, thus maximizing performance.

My logic is:
splitLists is a list of list, where each list contains the particles for a process, len(splitLists) is the number of processes invoked for refinement, and gpuIDs contains the list of available gpus.
We could create a process to gpu mapping by

gpu_proc_mapping = [gpusIDs[p%len(gpuIDs)] for p in range(len(splitLists))]

This should be a better way than gpuIDs*(len(splitLists)//len(gpuIDs))), which is a simple hack! I am running some alignments to see if it is all right.

sroet · 2025-04-11T09:41:29Z

The alignment or average process takes less than 1GB gpu memory.

That is very dependent on the size of your subtomogram but I do agree that at least having the option to spawn more processes per gpu seems like a good idea

So we can at least running 4 processes on a single gpu, thus maximizing performance.

That is fine, the problem is that it now might be random which GPU each process uses (and even how many processes try to use a single GPU) especially if 1 GPU is (significantly) slower than others (because it is also processing graphics output for a workstation or something like that).

For example if you have 4 processes (p0, p1, p2, and p3) and 2 GPUs (g0, g1), where g0 is way slower than g1 (
Your list would be alternating g0,g1 as GPU_IDs which would result in the following order:

# first 4 subtomograms
p0: g0, p1:g1, p2:g0, p3:g1 => g0 has 2 procs, g1 has 2 procs

# gpu1 finishes their job, p1, and p3 grab new jobs (with index g0 and g1)
p0: g0, p1:g0, p2:g0, p3:g1 => g0 has 3 procs, g1 has 1 proc

# gpu1 now finishes even quicker as it has to do less context switching, p3 grabs a new job (g0) (say p0 and p2 are still not done)
p0: g0, p1:g0, p2:g0 p3:g0 => g0 has 4 procs, g1 is idle

Which is why I proposed to link the gpu index it to the MPI process rank (something like gpu_IDs[MPI.COMM_WORLD.Get_rank()%len(GPU_IDs)], which in the above example would lock p1 and p3 to gpu1 and p0 and p2 to gpu0

alncat · 2025-04-14T02:47:17Z

@sroet
For 21 processes (one is reserved for master) and 4 gpus,
In mpi.parfor, the argument generated by list(zip(...))) function ` will be converted into the following format:

[(<pytom.basic.structures.ParticleList object at 0x7fcd0df74820>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 0), (<pytom.basic.structures.ParticleList object at 0x7fcd0df670d0>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 1), (<pytom.basic.structures.ParticleList object at 0x7fcd0df67040>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 2), (<pytom.basic.structures.ParticleList object at 0x7fcd0df63a30>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 3), (<pytom.basic.structures.ParticleList object at 0x7fcd0df63760>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 0), ..., None]
The total length of this list will be 21 since we spawned 21 processes, but the last element will be None. This is resulted from the following sequence of operations. Firstly, master is excluded when generating the evenSplitList

PyTom/pytom/alignment/GLocalSampling.py

Line 903 in 3a89e44

    
           splitLists = particleList.splitNSublists(splitFactor-1)  # somehow better to not include master...

So the evenSplitLis is of size 20. When passing into mpi.parfor, the argument generated by list(zip(...)) is padded to the true size of mpi world, which is 21!

PyTom/pytom/agnostic/mpi.py

Line 147 in 3a89e44

ddata = self._split_seq(data, self.size)

But the None element is also properly handled in the mpi.parfor, which simply skips None.

PyTom/pytom/agnostic/mpi.py

Line 155 in 3a89e44

if dd is None:

Otherwise, a process will get the argument,

<pytom.basic.structures.ParticleList object at 0x7fcd0df74820>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 0

The first one in above list is the subset of particle list this process will work on, while the last one in above list the gpuID that the process will bind to.

So the modification [gpusIDs[p%len(gpuIDs)] for p in range(len(splitLists))] actually works and binds one working process to one gpu, and each process will align its own particle list during the whole loop, but only 20 out 21 will actually do the job (will be working process). The alignment speed is limited by the slowest gpu as you said.

My alignment command typically is like
mpiexec -n 21 GLocalJob.py --particleList matching26s.xml --mask Refine3D/26s/mask_26sfull.mrc --numberIterations 10 --pixelSize 3.37 --particleDiameter 440 --binning 2 --destination Refine3D/26s/ --SphericalMask --angleShells 3 --angleIncrement 3.00 --jobName Refine3D/26s/glocal_input_params_align.xml --reference Refine3D/26s/ref_flipped.mrc --gpuID 0,1,2,3
where gpuID specifies the list of gpus.

sroet · 2025-04-14T09:04:35Z

Thank you for the check and the extensive write-up!

Do you mind adding/pushing the modification? That is the only thing that is left for me to approve this!

alncat · 2025-04-14T12:01:15Z

@sroet , thank you for accepting this!
I would also like to elaborate on the modification about shift obtained from cross correlation map:

# get the shift without binning
peak_shift = [ip*self.binning for ip in peak_shift]

In GLocalAlignmentPlan, the volumes are binned

PyTom/pytom/gpu/gpuStructures.py

Line 236 in 3a89e44

    
           self.shape = tuple(cp.asnumpy((cp.around(cp.array(shape) * (1 / self.binning), 0)).astype(cp.int))) if \

If the input volume is of shape $D^3$, and bin factor is $b$, the ccc_map will be of shape $(D/b)^3$

PyTom/pytom/gpu/gpuStructures.py

Line 437 in 3a89e44

    
           peak_shift = [ip - s // 2 for ip, s in zip(interpolated_peak, self.ccc_map.shape)]

the peakshift determined from ccc_map will then scaled by 1/b as well. Hence, to get the correct shift, we should multiply the peakshift by a factor of b

peak_shift = [ip*self.binning for ip in peak_shift]

correct shift when alignment is performed with binning

alncat · 2025-04-15T02:13:21Z

@sroet I revised the commit and resubmitted :-). Thank you for reviewing!

sroet

LGTM (looks good to me!), will merge after the unit tests complete and pass.
Thanks again for the contribution

enable subtomogram alignment on multiple gpus

396fb22

correct shift when alignment is performed with binning

alncat force-pushed the alncat branch from fadc4c8 to 396fb22 Compare April 15, 2025 02:08

sroet approved these changes Apr 16, 2025

View reviewed changes

sroet merged commit 4089d5d into SBC-Utrecht:master Apr 16, 2025
1 check passed

This was referenced Apr 16, 2025

performance enhancement for Glocalsampling #63

Closed

wrong shift determined by subPixelMaxSpline function in gpuStructures.py #64

Closed

alncat deleted the alncat branch April 17, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable subtomogram alignment on multiple gpus #69

enable subtomogram alignment on multiple gpus #69

Uh oh!

alncat commented Apr 10, 2025

Uh oh!

sroet commented Apr 10, 2025

Uh oh!

alncat commented Apr 10, 2025

Uh oh!

alncat commented Apr 10, 2025

Uh oh!

alncat commented Apr 11, 2025 •

edited

Loading

Uh oh!

sroet commented Apr 11, 2025

Uh oh!

alncat commented Apr 14, 2025 •

edited

Loading

Uh oh!

sroet commented Apr 14, 2025

Uh oh!

alncat commented Apr 14, 2025 •

edited

Loading

Uh oh!

alncat commented Apr 15, 2025

Uh oh!

sroet left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

enable subtomogram alignment on multiple gpus #69

enable subtomogram alignment on multiple gpus #69

Uh oh!

Conversation

alncat commented Apr 10, 2025

Uh oh!

sroet commented Apr 10, 2025

Uh oh!

alncat commented Apr 10, 2025

Uh oh!

alncat commented Apr 10, 2025

Uh oh!

alncat commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sroet commented Apr 11, 2025

Uh oh!

alncat commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sroet commented Apr 14, 2025

Uh oh!

alncat commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alncat commented Apr 15, 2025

Uh oh!

sroet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alncat commented Apr 11, 2025 •

edited

Loading

alncat commented Apr 14, 2025 •

edited

Loading

alncat commented Apr 14, 2025 •

edited

Loading