Skip to content

Conversation

@alncat
Copy link
Contributor

@alncat alncat commented Apr 10, 2025

Dear PyTom developers,
I have made a small modifications which enables the glocalsampling to utilize multiple gpus for multiple processes. This is achieved by create a copy of gpuIDs for each split on each process. My personal test found that this modification greatly enhances the parallelism of glocaljob.

Best Regards,
Zhenwei

@sroet
Copy link
Collaborator

sroet commented Apr 10, 2025

Hey @alncat, thanks for the contribution!
I am a bit surprised this used to work at all, as the zip function seems to assume single gpu.
One thing that needs to be fixed is that we don't want two MPI procs claiming the same GPU (which is a possibility in the current adaptation), so we should probably use the rank of the MPI proccesses select which gpu from the gpuID list they should claim. Please let me know if you need more pointers for that implementation

@alncat
Copy link
Contributor Author

alncat commented Apr 10, 2025

@sroet , you may also check another commit about the shift determined during alignment. I multiplied it with the binning factor.

@alncat
Copy link
Contributor Author

alncat commented Apr 10, 2025

Hey @alncat, thanks for the contribution! I am a bit surprised this used to work at all, as the zip function seems to assume single gpu. One thing that needs to be fixed is that we don't want two MPI procs claiming the same GPU (which is a possibility in the current adaptation), so we should probably use the rank of the MPI proccesses select which gpu from the gpuID list they should claim. Please let me know if you need more pointers for that implementation

I will check it tomorrow.

@alncat
Copy link
Contributor Author

alncat commented Apr 11, 2025

@sroet ,

One thing that needs to be fixed is that we don't want two MPI procs claiming the same GPU (which is a possibility in the current adaptation)

I think it might be beneficial to let multiple process running on the same gpu. The alignment or average process takes less than 1GB gpu memory. Modern nvidia gpu often has at least 4GB memory. So we can at least running 4 processes on a single gpu, thus maximizing performance.

My logic is:
splitLists is a list of list, where each list contains the particles for a process, len(splitLists) is the number of processes invoked for refinement, and gpuIDs contains the list of available gpus.
We could create a process to gpu mapping by

gpu_proc_mapping = [gpusIDs[p%len(gpuIDs)] for p in range(len(splitLists))]

This should be a better way than gpuIDs*(len(splitLists)//len(gpuIDs))), which is a simple hack! I am running some alignments to see if it is all right.

@sroet
Copy link
Collaborator

sroet commented Apr 11, 2025

The alignment or average process takes less than 1GB gpu memory.

That is very dependent on the size of your subtomogram but I do agree that at least having the option to spawn more processes per gpu seems like a good idea

So we can at least running 4 processes on a single gpu, thus maximizing performance.

That is fine, the problem is that it now might be random which GPU each process uses (and even how many processes try to use a single GPU) especially if 1 GPU is (significantly) slower than others (because it is also processing graphics output for a workstation or something like that).

For example if you have 4 processes (p0, p1, p2, and p3) and 2 GPUs (g0, g1), where g0 is way slower than g1 (
Your list would be alternating g0,g1 as GPU_IDs which would result in the following order:

# first 4 subtomograms
p0: g0, p1:g1, p2:g0, p3:g1 => g0 has 2 procs, g1 has 2 procs

# gpu1 finishes their job, p1, and p3 grab new jobs (with index g0 and g1)
p0: g0, p1:g0, p2:g0, p3:g1 => g0 has 3 procs, g1 has 1 proc

# gpu1 now finishes even quicker as it has to do less context switching, p3 grabs a new job (g0) (say p0 and p2 are still not done)
p0: g0, p1:g0, p2:g0 p3:g0 => g0 has 4 procs, g1 is idle

Which is why I proposed to link the gpu index it to the MPI process rank (something like gpu_IDs[MPI.COMM_WORLD.Get_rank()%len(GPU_IDs)], which in the above example would lock p1 and p3 to gpu1 and p0 and p2 to gpu0

@alncat
Copy link
Contributor Author

alncat commented Apr 14, 2025

@sroet
For 21 processes (one is reserved for master) and 4 gpus,
In mpi.parfor, the argument generated by list(zip(...))) function ` will be converted into the following format:

[(<pytom.basic.structures.ParticleList object at 0x7fcd0df74820>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 0), (<pytom.basic.structures.ParticleList object at 0x7fcd0df670d0>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 1), (<pytom.basic.structures.ParticleList object at 0x7fcd0df67040>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 2), (<pytom.basic.structures.ParticleList object at 0x7fcd0df63a30>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 3), (<pytom.basic.structures.ParticleList object at 0x7fcd0df63760>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 0), ..., None]
The total length of this list will be 21 since we spawned 21 processes, but the last element will be None. This is resulted from the following sequence of operations. Firstly, master is excluded when generating the evenSplitList

splitLists = particleList.splitNSublists(splitFactor-1) # somehow better to not include master...

So the evenSplitLis is of size 20. When passing into mpi.parfor, the argument generated by list(zip(...)) is padded to the true size of mpi world, which is 21!
ddata = self._split_seq(data, self.size)

But the None element is also properly handled in the mpi.parfor, which simply skips None.

if dd is None:

Otherwise, a process will get the argument,

<pytom.basic.structures.ParticleList object at 0x7fcd0df74820>, <pytom.agnostic.structures.Reference object at 0x7fcd0df67c10>, None, 'Refine3D/26s//CurrentRotations.xml', 'Refine3D/26s//CurrentScore.xml', 'Refine3D/26s//CurrentMask.xml', <pytom.alignment.preprocessing.Preprocessing object at 0x7fcd0df774f0>, True, 2, False, 0

The first one in above list is the subset of particle list this process will work on, while the last one in above list the gpuID that the process will bind to.

So the modification [gpusIDs[p%len(gpuIDs)] for p in range(len(splitLists))] actually works and binds one working process to one gpu, and each process will align its own particle list during the whole loop, but only 20 out 21 will actually do the job (will be working process). The alignment speed is limited by the slowest gpu as you said.

My alignment command typically is like
mpiexec -n 21 GLocalJob.py --particleList matching26s.xml --mask Refine3D/26s/mask_26sfull.mrc --numberIterations 10 --pixelSize 3.37 --particleDiameter 440 --binning 2 --destination Refine3D/26s/ --SphericalMask --angleShells 3 --angleIncrement 3.00 --jobName Refine3D/26s/glocal_input_params_align.xml --reference Refine3D/26s/ref_flipped.mrc --gpuID 0,1,2,3
where gpuID specifies the list of gpus.

@sroet
Copy link
Collaborator

sroet commented Apr 14, 2025

Thank you for the check and the extensive write-up!

Do you mind adding/pushing the modification? That is the only thing that is left for me to approve this!

@alncat
Copy link
Contributor Author

alncat commented Apr 14, 2025

@sroet , thank you for accepting this!
I would also like to elaborate on the modification about shift obtained from cross correlation map:

# get the shift without binning
peak_shift = [ip*self.binning for ip in peak_shift]

In GLocalAlignmentPlan, the volumes are binned

self.shape = tuple(cp.asnumpy((cp.around(cp.array(shape) * (1 / self.binning), 0)).astype(cp.int))) if \

If the input volume is of shape $D^3$, and bin factor is $b$, the ccc_map will be of shape $(D/b)^3$

peak_shift = [ip - s // 2 for ip, s in zip(interpolated_peak, self.ccc_map.shape)]

the peakshift determined from ccc_map will then scaled by 1/b as well. Hence, to get the correct shift, we should multiply the peakshift by a factor of b

peak_shift = [ip*self.binning for ip in peak_shift]

correct shift when alignment is performed with binning
@alncat
Copy link
Contributor Author

alncat commented Apr 15, 2025

@sroet I revised the commit and resubmitted :-). Thank you for reviewing!

Copy link
Collaborator

@sroet sroet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (looks good to me!), will merge after the unit tests complete and pass.
Thanks again for the contribution

@sroet sroet merged commit 4089d5d into SBC-Utrecht:master Apr 16, 2025
1 check passed
@alncat alncat deleted the alncat branch April 17, 2025 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants