Skip to content

Conversation

@AxelHenningsson
Copy link
Contributor

@AxelHenningsson AxelHenningsson commented Jan 24, 2026

This PR represents a fully backwards compatible upgrade that goes towards heavily speeding up ImageD11.sinograms.point_by_point.compute_gve which is a core call in various s3DXRD point-by-point mapping algorithms. The new implementation offers a 2 x - 8 x speedup across data-set sizes.

What is added:

  • A class member function to ImageD11.transform.Ctransform named sf2gvlocal which represents a local variation of the existing class member function sf2gv. sf2gvlocal accepts xpos , per peak diffraction origins along a pencil-beam. This makes it fully equivalent to ImageD11.sinograms.point_by_point.compute_gve.
  • A c-backend implementation supportnig sf2gvlocal to be compile-speed. This backend lives in cImageD11 and is simply a function compute_xlylzl_xpos_variable exactly like compute_xlylzl but with the extension to diffraction origins.

Benefits:

Some simple tests. We start by defining some wrappers for the functions we want to compare:

args_keys = "z_center z_size y_center y_size tilt_x tilt_y tilt_z o11 o21 o22 o12 t_x t_y t_z wedge chi wavelength distance"
kwargs = {k: float(pks.parameters.parameters[k]) for k in args_keys.split()}


# this is currently used in the pbp indexer of ImageD11.
def pbp_compute_gve(sc, fc, omega, xpos):
    return pbp.compute_gve(sc, fc, omega, xpos, **kwargs)

# this can be constant, no need to reinitialize it per-call
computer = transform.Ctransform(pks.parameters.parameters)


# This would be a lot faster imo.
def local_gve(sc, fc, omega, xpos):
    return computer.sf2gvlocal(sc, fc, omega, xpos).T


# we collect the alternatives for benchmarking purposes.
gfuncs = {
    "ImageD11.sinograms.point_by_point.compute_gve": pbp_compute_gve,
    "ImageD11.transform.Ctransform.sf2gvlocal": local_gve,
}

Some single thread benchmarks on a large (real) dataset:

# making sure we have a single thread to make a fair comparison.
nthreads = 1
cImageD11.cimaged11_omp_set_num_threads(nthreads)
os.environ["OMP_NUM_THREADS"] = str(nthreads)
os.environ["MKL_NUM_THREADS"] = str(nthreads)
os.environ["OPENBLAS_NUM_THREADS"] = str(nthreads)
os.environ["NUMEXPR_NUM_THREADS"] = str(nthreads)
numba.set_num_threads(nthreads)
assert cImageD11.cimaged11_omp_get_max_threads() == nthreads

# and let's benchmark.
print(f"Benchmarking with {nthreads} thread(s) and {len(sc) / 1e6} MILLION peaks")
out = {}
for k in gfuncs:
    t1 = time.perf_counter()
    out[k] = gfuncs[k](sc, fc, omega, xpos)
    t2 = time.perf_counter()
    print(f"{k} : Time taken: {t2 - t1} seconds")
residuals = (
    out["ImageD11.transform.Ctransform.sf2gvlocal"]
    - out["ImageD11.sinograms.point_by_point.compute_gve"]
)
assert np.allclose(
    out["ImageD11.transform.Ctransform.sf2gvlocal"],
    out["ImageD11.sinograms.point_by_point.compute_gve"],
)  # clearly these should be exactly identical!
print(
    f"Residuals: max = {residuals.max()}, min = {residuals.min()}, mean = {residuals.mean()}, std = {residuals.std()}"
)

Output:

Benchmarking with 1 thread(s) and 10.0 MILLION peaks
ImageD11.sinograms.point_by_point.compute_gve : Time taken: 2.178051526003401 seconds
ImageD11.transform.Ctransform.sf2gvlocal : Time taken: 0.27134850100264885 seconds
Residuals: max = 1.1102230246251565e-15, min = -1.1102230246251565e-15, mean = 4.466677284231931e-18, std = 1.7402367609431068e-16

The speedup is clear. The functions produce the same output 👍

A deeper benchmark across variable data-set sizes:

# In depth benchmarking...
timings = {k: [] for k in gfuncs}
statruns_high = 20
statruns_low = 100
dtymin, dtymax = pks.dty.min(), pks.dty.max()
npks = np.logspace(1, 7, 30).astype(int)
for n in npks:
    xpos = np.random.uniform(dtymin, dtymax, n)
    sinomega, cosomega, dty, omega, sc, fc, xpos = (
        pks.sinomega[0:n].astype(float),
        pks.cosomega[0:n].astype(float),
        pks.dty[0:n].astype(float),
        pks.omega[0:n].astype(float),
        pks.sc[0:n].astype(float),
        pks.fc[0:n].astype(float),
        xpos,
    )
    gargs = [sc, fc, omega, xpos]
    statruns = statruns_high if n > 5000 else statruns_low
    for k in gfuncs:
        gfuncs[k](*gargs)  # warmup
        t1 = time.perf_counter()
        for i in range(statruns):
            gfuncs[k](*gargs)
        t2 = time.perf_counter()
        timings[k].append((t2 - t1) / statruns)

This gives us the following profile:
image

The plotting code that was used:

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(
    npks,
    100
    * np.array(timings["ImageD11.sinograms.point_by_point.compute_gve"])
    / np.array(timings["ImageD11.transform.Ctransform.sf2gvlocal"]),
    label="time(compute_gve) / time(sf2gvlocal)",
    marker="o",
)
ax.set_title("Speedup of sf2gvlocal over compute_gve")
ax.legend(fontsize=18)
ax.set_xscale("log")
ax.set_ylim([0, 1000])
ax.set_yticks(range(0, 901, 100))
ax.set_xticks([10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000])
for spine in ax.spines.values():
    spine.set_visible(False)
ax.grid(True, alpha=0.3)
ax.set_xlabel("Number of peaks")
ax.set_ylabel("Speedup (%)")
plt.show()

Outlook

Now one may replace ImageD11.sinograms.point_by_point.compute_gve with a call to ImageD11.transform.Ctransform.sf2gvlocal 🙂

Apologies

I apologise if any code formatting changed uneccessarily due to on save auto-Ruff formating (black).

@jonwright
Copy link
Member

Looks good. It would be good to put your benchmarking code into the test folder somewhere to keep it around. A few things to clarify:

  • Is it slower on small arrays up to ~100 peaks? I didn't see why
  • We aren't using openmp in here. I don't remember why not?
  • In both cases we are computing xlylzl, then gv. Perhaps we can merge the two functions.
  • Shouldn't we be calling this in point-by-point if it is quicker?

I am happy to merge as it is and add a 'todo'. Let me know what you think?

@AxelHenningsson
Copy link
Contributor Author

Hi!

  • Yes, my bad, I accidentaly put the Ctransform constructor in the wrapper this gives a microsecond ov overhead visible for small arrays. Now fixed as:
# this can be constant, no need to reinitialize it per-call
computer = transform.Ctransform(pks.parameters.parameters)

# This would be a lot faster imo.
def local_gve(sc, fc, omega, xpos):
    return computer.sf2gvlocal(sc, fc, omega, xpos).T

With this fixed the speedup makes more sense for the small arrays (see edited post and edited plot). I.e about a factor of x2-x3 for small arrays.

  • openmp? Computing local gve is single threaded task imo, the parallelism should then be added over voxels. Sub-threading has been a souce of pain for a long time. Do you mean that yo would like for columnfile.updateGeometry to run many threads? Anyways, I think that is a differen usecase.

  • Probably, yes! This was a minimal change to get some speed for free. This is on TODO.

  • "Shouldn't we be calling this in point-by-point if it is quicker?" - YES 🙂 I am also working on the get_voxel_idx() method, it is currently extreemely slow since it is touching all data:

@numba.njit(cache=True)
def get_voxel_idx(y0, xi0, yi0, sinomega, cosomega, dty, ystep):
    """
    get peaks at xi0, yi0
    basically just geometry.dty_values_grain_in_beam_sincos
    """

    ydist = np.abs(y0 - xi0 * sinomega - yi0 * cosomega - dty)
    idx = np.where(ydist <= ystep)[0]

    return idx, ydist

I belive this is the main bottleneck for a lot of things, once I have a similar speedup on get_voxel_idx function I may (want to) look into adding these things into PBP. This gives: fast masking + fast gve computation. I have a hierarcical partition algorithm for get_voxel_idx that probably will do well (sorting the peaks to make get_voxel_idx blazing). James mentioned you had something as well for get_voxel_idx function, but not yet in main?

Regardless I am happy to merge this progress with TODOs.

@jonwright
Copy link
Member

There is a bunch of code around for the get_voxel_idx issue - I will try to upload in a separate pull request later today - just need to find it. The idea is to make a frame sorted cf_2d, then a sinogram image of frame number, first peak and last peak. To harvest peaks, you working in sinogram space and then harvest in blocks by frame (e.g. ydist is equal for all peaks on the same frame).

For the threading, it depends, it can be an issue for 10^9 peaks when setting up to filter by ring number, etc. There should be some openmp option to use one thread for small problems and the default thread number otherwise.

I will merge now and create an issue...

@jonwright jonwright merged commit cf89d6c into FABLE-3DXRD:master Jan 26, 2026
8 checks passed
@jonwright jonwright mentioned this pull request Jan 26, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants