Skip to content

Enhance GUI and benchmarking tools, add wpr for post and scheduling curve has been improved#106

Merged
chaoming0625 merged 17 commits intomainfrom
dev-fcn-optimizing-3.21
Mar 30, 2026
Merged

Enhance GUI and benchmarking tools, add wpr for post and scheduling curve has been improved#106
chaoming0625 merged 17 commits intomainfrom
dev-fcn-optimizing-3.21

Conversation

@Hepbmstl
Copy link
Copy Markdown
Collaborator

@Hepbmstl Hepbmstl commented Mar 25, 2026

This pull request introduces a major performance and scalability improvement to the binary fully-connected matrix-vector (fcnmv) CUDA kernels, specifically optimizing the scatter mode for high fan-out (large n_conn) scenarios. The update adds new warp-per-row (WPR) CUDA kernels, implements an auto-dispatch mechanism to select between thread-per-row (TPR) and WPR based on problem size, and refines the Python interface accordingly. Additionally, there are minor improvements in code organization and initialization.

CUDA kernel improvements:

  • Added new WPR (warp-per-row) CUDA kernels (_bs_wpr_homo_kern, _bs_wpr_hetero_kern) for scatter mode, which improve performance for large n_conn by parallelizing atomicAdd operations across a warp.
  • Modified the scatter FFI entry points (binary_fcnmv_scatter_homo, binary_fcnmv_scatter_hetero) to auto-dispatch between TPR and WPR kernels based on a cubic threshold function of n_pre and n_conn, ensuring optimal kernel selection for different matrix sizes. [1] [2]
  • Registered WPR kernel variants for all supported data types (float32, float64, float16, bfloat16) in the kernel macro instantiations. [1] [2] [3] [4]

Python interface and logic:

  • Updated the Python kernel selection logic in _binary_fcnmv_cuda_kernel to use new kernel names and ensure the spikes array is always boolean for scatter/gather kernels.
  • Ensured spike input is consistently cast to boolean in the kernel call path, improving correctness and compatibility.

Documentation and code organization:

  • Updated kernel documentation in binary_fcnmv.cu to explain the new TPR/WPR auto-dispatch and performance crossover, replacing outdated comments. [1] [2]
  • Minor code cleanups and initialization improvements in benchmarking tools, including initializing a _tags dictionary in BenchmarkTools.py (renamed from CsvOutput.py).

These changes significantly improve the scalability and efficiency of the binary fcnmv scatter operation, especially for large, high-fan-out neural network layers.

Summary by Sourcery

Introduce warp-per-row CUDA scatter kernels with auto-dispatch, update Python bindings to use boolean spike inputs, and extend benchmarking/GUI tooling for backend-focused analysis and performance boundary exploration.

New Features:

  • Add warp-per-row scatter CUDA kernels for homogeneous and heterogeneous binary FCNMV operations across all supported data types.
  • Add tooling to generate VRAM-constrained (scale, connectivity) parameter grids and run boundary-focused COBA binary FCNMV benchmarks.
  • Introduce an interactive GUI app to visualize performance boundaries and speedups from benchmark CSVs.

Enhancements:

  • Auto-select between thread-per-row and warp-per-row scatter kernels in the CUDA FFI based on problem size, including early exit for empty inputs.
  • Ensure spikes are consistently cast to boolean in CUDA and pure-Python FCNMV call paths for scatter/gather kernels.
  • Improve benchmark result recording with persistent CSV tags and schema-preserving appends, and retarget GUI controls from data type to backend comparisons.

Documentation:

  • Refresh CUDA kernel documentation to describe the new scatter kernel strategies and performance crossover behavior.

Hepbmstl added 16 commits March 12, 2026 21:51
The float-to-bool conversion of the mv operator will proceed after thorough testing
- Implemented `COBA_2005_binary_fcnmv_boundary_CsvOuput.py` for benchmarking post and pre-synaptic connection updates using JAX.
- Created `boundary_dis.py` for a GUI application to visualize performance boundaries and speedup analysis with interactive features.
- Developed `dev_COBA_binary_fcnmv.py` for benchmarking with various connection probabilities and numbers, enhancing simulation capabilities.
- Added error handling and CSV recording functionalities to capture benchmarking results effectively.
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 25, 2026

Reviewer's Guide

Adds warp-per-row (WPR) CUDA kernels and an auto-dispatch path for binary fcnmv scatter, updates Python FFI/kernel selection to use boolean spikes consistently, and enhances benchmarking/GUI tooling for exploring performance boundaries and backend comparisons.

Sequence diagram for binary fcnmv scatter path with WPR/TPR auto-dispatch

sequenceDiagram
    participant Caller as python_op
    participant KernelSel as _binary_fcnmv_cuda_kernel
    participant JAX as jax_ffi
    participant FFIHomo as binary_fcnmv_scatter_homo
    participant FFIHetero as binary_fcnmv_scatter_hetero
    participant WPRKern as _bs_wpr_*_kern
    participant TPRKern as _bs_tpr_*_kern

    Caller->>KernelSel: call(transpose=True, mode, spikes, weights, indices)
    KernelSel->>KernelSel: build kernel_name using suffix _bool
    KernelSel->>KernelSel: spikes = u.math.asarray(spikes, dtype=bool)
    KernelSel->>JAX: jax.ffi.ffi_call(kernel_name, out_info)
    JAX-->>Caller: returns callable kernel(weights, indices, spikes_bool)

    Caller->>JAX: kernel(weights, indices, spikes_bool)
    JAX->>FFIHomo: binary_fcnmv_scatter_homo(weights, indices, spikes_bool, output, stream)
    JAX->>FFIHetero: binary_fcnmv_scatter_hetero(weights, indices, spikes_bool, output, stream)

    activate FFIHomo
    FFIHomo->>FFIHomo: read n_pre, n_conn, n_post from tensors
    FFIHomo->>FFIHomo: cudaMemsetAsync(output)
    FFIHomo->>FFIHomo: if n_pre == 0 return
    FFIHomo->>FFIHomo: compute bsz = 256
    alt n_conn * 2084000 > n_pre * 1539
        FFIHomo->>FFIHomo: warps_per_block = bsz / 32
        FFIHomo->>FFIHomo: n_blocks_wpr = ceil(n_pre / warps_per_block)
        FFIHomo->>WPRKern: launch _bs_wpr_homo_kern<<<n_blocks_wpr, 256>>>
    else
        FFIHomo->>FFIHomo: n_blocks_tpr = ceil(n_pre / bsz)
        FFIHomo->>TPRKern: launch _bs_tpr_homo_kern<<<n_blocks_tpr, 256>>>
    end
    deactivate FFIHomo

    activate FFIHetero
    FFIHetero->>FFIHetero: same dispatch logic for hetero weights
    alt n_conn * 2084000 > n_pre * 1539
        FFIHetero->>WPRKern: launch _bs_wpr_hetero_kern
    else
        FFIHetero->>TPRKern: launch _bs_tpr_hetero_kern
    end
    deactivate FFIHetero

    WPRKern-->>JAX: complete scatter accumulation
    TPRKern-->>JAX: complete scatter accumulation
    JAX-->>Caller: updated output tensor
Loading

Sequence diagram for CSV_record tagging and row enrichment

sequenceDiagram
    actor Dev as benchmark_script
    participant BT as BenchmarkToolsModule
    participant CSV as CSV_record

    Dev->>BT: csv_recorder = CSV_record(...)
    Dev->>CSV: add_tag("warp_or_thread", "tpr")
    CSV->>CSV: _tags["warp_or_thread"] = "tpr"
    CSV->>CSV: ensure fieldnames contains tag key

    loop for each measurement
        Dev->>CSV: add_row(raw_row)
        CSV->>CSV: merged_row = dict(_tags)
        CSV->>CSV: merged_row.update(raw_row)
        CSV->>CSV: extend fieldnames with keys from merged_row
        CSV->>CSV: rows.append(merged_row)
    end

    Dev->>CSV: record_finish(run_tag)
    CSV->>CSV: _write_csv(file_name, rows, fieldnames, mode)
    CSV-->>Dev: CSV file with tag columns merged into all rows
Loading

Class diagram for CSV_record benchmarking helper and PerformanceBoundaryApp GUI

classDiagram
    class CSV_record {
        +str name
        +str suffix
        +Path output_dir
        +bool append
        +list fieldnames
        +list rows
        +dict _tags
        +CSV_record(name, operator, suite, duration, conn)
        +_write_csv(file_name, rows, fieldnames, mode)
        +add_tag(tag_name, tag_value)
        +add_row(row)
        +single_COBA_data_add(operator, data_type, backend, mode, conn_num, scale, elapsed, rate, duration, homo)
        +print_header(operator, data_type, backend, mode, conn_num, duration, homo, prob)
        +print_table_header(show_conn)
        +print_row(scale, neuron_count, elapsed, rate, conn_num)
        +record_finish(tag)
    }

    class BenchmarkToolsModule {
        +generate_params(dis_type, _N, limit_gb, target_samples, data_size, scale_max, conn_max) list
        +memory_limit(conn_nums, scale, _N, limit, data_type) bool
    }

    class PerformanceBoundaryApp {
        -Tk root
        -DataFrame df
        -dict comboboxes
        -Notebook notebook
        +PerformanceBoundaryApp(root)
        +load_data()
        +update_plots()
        +export_image()
        +_setup_ui()
        +_on_compare_field_changed(event)
        +_rebuild_filter_row()
        +_on_auto_speedup_changed()
        +_subtitle(include_baseline) str
        +_render_scatter(tab, x, y, z, _N, limit_gb, x_min, x_max, y_max, z_label, cmap_name, subtitle)
        +_render_interpolation(tab, x, y, z, _N, limit_gb, x_min, x_max, y_max, z_label, cmap_name, subtitle)
        +_render_speedup_interp(tab, grid_x, grid_y, grid_z_masked, _N, limit_gb, x_min, x_max, y_max, subtitle)
        +_draw_boundaries(ax, _N, limit_gb, x_min, x_max)
        +_draw_contours_and_labels(ax, grid_x, grid_y, grid_z_masked, z_pts)
        +_draw_custom_contours(ax, grid_x, grid_y, grid_z_masked)
    }

    class COBABenchmarkScripts {
        +benchmark_post_conn(conn_num, conn_prob, data_type, duration, homo, backend, probs_or_conn, _N)
        +benchmark_pre_conn(conn_num, conn_prob, data_type, duration, homo, backend, probs_or_conn, _N)
    }

    BenchmarkToolsModule --> CSV_record : uses
    COBABenchmarkScripts --> CSV_record : instantiates
    COBABenchmarkScripts --> BenchmarkToolsModule : calls
    PerformanceBoundaryApp --> BenchmarkToolsModule : consumes CSV output
Loading

File-Level Changes

Change Details Files
Introduce WPR scatter CUDA kernels and auto-dispatch between TPR and WPR based on (n_pre, n_conn).
  • Add warp-per-row homo/hetero scatter kernels that distribute atomicAdd work across warp lanes.
  • Instantiate WPR kernels for all supported dtypes alongside existing TPR variants.
  • Change scatter FFI wrappers to zero outputs, guard n_pre==0, and choose WPR vs TPR via a cubic threshold function of n_pre and n_conn.
  • Update kernel documentation to describe TPR vs WPR behavior and crossover as a function of n_pre and n_conn.
brainevent/_fcn/binary_fcnmv.cu
Make Python binary fcnmv interface always use boolean spikes and updated kernel names.
  • Switch scatter/gather kernel_name construction to use the _bool CUDA entry points regardless of original spike dtype.
  • Ensure spikes are cast to boolean before JAX FFI calls in both the fast CUDA path and the generic path.
  • Adjust comments to reflect new auto-dispatch and boolean spike semantics.
brainevent/_fcn/binary.py
Refactor and extend benchmarking CSV utilities to support tags, schema evolution, and memory-bounded parameter generation.
  • Rename CsvOutput module to BenchmarkTools and initialize an internal _tags dict on CSV_record instances.
  • Make CSV writing robust to schema evolution by merging existing and new fieldnames when appending and using a default rest value.
  • Add tag support so persistent metadata columns are automatically merged into subsequent rows.
  • Provide helpers for generating valid (scale, conn_num) pairs under VRAM limits and a reusable memory_limit check.
dev/fcn/BenchmarkTools.py
Update COBA benchmarking scripts to use the new BenchmarkTools utilities and revised experiment configs.
  • Replace imports of the old CsvOutput module with dev.fcn.BenchmarkTools and use its CSV_record and memory_limit utilities.
  • Adjust benchmark parameter grids (scales, conn_nums, probs), default main entrypoints, and output names for new experiments.
  • Add new development/benchmark drivers for binary fcnmv, including VRAM boundary sweeps and Nsight-focused runs that use generate_params and tags.
dev/fcn/COBA_2005_binary_fcnmv_CsvOuput.py
dev/fcn/COBA_2005_binary_fcnmm_CsvOuput.py
dev/fcn/COBA_2005_bitpack_binary_fcnmm_CsvOuput.py
dev/fcn/COBA_2005_compact_binary_fcnmm_CsvOuput.py
dev/fcn/dev_COBA_binary_fcnmv.py
dev/fcn/COBA_2005_binary_fcnmv_boundary_CsvOuput.py
Retarget the benchmarking GUI to compare backends instead of data types and align messages/filters accordingly.
  • Change default color/target selectors in latency/COBA tabs from data_type to backend.
  • Use backend values for baseline/target selection in speedup and COBA heatmaps and update corresponding error messages.
  • Remove data_type from heatmap filter skip set and ensure backend is treated as the primary comparison axis.
dev/fcn/gui.py
Add an interactive boundary and speedup visualization GUI for sparse matrix performance analysis.
  • Introduce PerformanceBoundaryApp, a Tkinter/matplotlib tool that loads CSVs and visualizes elapsed time and speedup over (scale, conn_num).
  • Support raw scatter, interpolated elapsed time, and interpolated baseline/target speedup views with VRAM/feasibility boundaries and contour lines.
  • Implement flexible filtering, comparison-field selection, hover tooltips, and export-to-image functionality.
dev/fcn/boundary_dis.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The WPR/TPR auto-dispatch threshold for scatter ((int64_t)n_conn * 2084000 > (int64_t)n_pre * 1539) is currently inlined as magic numbers inside the FFI macros; consider extracting this into a small helper or named constants with a short comment (e.g. referring to the fit from boundary_dis.py) so it’s easier to adjust and reason about across architectures.
  • In _binary_fcnmv_cuda_kernel the non-TPR path now always targets _bool kernels and casts spikes to boolean, but there is still leftover code computing spk_f in the original dtype that is immediately overwritten; it would be clearer to remove the dead path and centralize the boolean casting in one place so the semantics of spikes for both gather and scatter are unambiguous.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The WPR/TPR auto-dispatch threshold for scatter (`(int64_t)n_conn * 2084000 > (int64_t)n_pre * 1539`) is currently inlined as magic numbers inside the FFI macros; consider extracting this into a small helper or named constants with a short comment (e.g. referring to the fit from `boundary_dis.py`) so it’s easier to adjust and reason about across architectures.
- In `_binary_fcnmv_cuda_kernel` the non-TPR path now always targets `_bool` kernels and casts `spikes` to boolean, but there is still leftover code computing `spk_f` in the original dtype that is immediately overwritten; it would be clearer to remove the dead path and centralize the boolean casting in one place so the semantics of `spikes` for both gather and scatter are unambiguous.

## Individual Comments

### Comment 1
<location path="brainevent/_fcn/binary_fcnmv.cu" line_range="184" />
<code_context>
+    if (row >= n_pre) return;                                                                 \
+    if (!IS_ACTIVE(__ldg(&spikes[row]))) return;                                              \
+    const int32_t* i_row = indices + (size_t)row * n_conn;                                    \
+    ACC_T w0 = READ_W(weights[0]);                                                            \
+    for (int k = lane; k < n_conn; k += 32) {                                                 \
+        int idx = __ldg(&i_row[k]);                                                           \
</code_context>
<issue_to_address>
**issue (bug_risk):** Homogeneous WPR kernel uses weights[0] for all rows instead of weights[row].

`w0` is loaded as `READ_W(weights[0])`, so all rows share the same weight instead of using their own. Since `row = tid / 32`, this should likely be `READ_W(weights[row])` to keep per-row homogeneous weights; otherwise rows that need different weights will all use `weights[0]`, corrupting the results.
</issue_to_address>

### Comment 2
<location path="dev/fcn/dev_COBA_binary_fcnmv.py" line_range="103" />
<code_context>
+                        t1 = time.time()
+                        elapsed = t1 - t0
+                        csv_recorder.print_row(s, n, elapsed, float(rate))
+                        csv_recorder.single_COBA_data_add('fcnmv', data_type, back, 'post', cn, s, elapsed, float(rate), duration, homo=('homo' if homo else 'hetero'))
+
+                    except Exception as e:
</code_context>
<issue_to_address>
**issue (bug_risk):** NameError risk: `cn` is used in the prob-based branch where only `actual_conn_num` is defined.

In the `probs_or_conn != 'conn'` branch of `benchmark_post_conn`, within the `for s in scales:` loop, `single_COBA_data_add` is called with `cn`, which is not defined there. This will raise a `NameError` at runtime; it should likely use `actual_conn_num` instead.
</issue_to_address>

### Comment 3
<location path="brainevent/_fcn/binary.py" line_range="304" />
<code_context>
         else:
             spk_f = (spikes > 0).astype(weights.dtype)

+        spk_f = u.math.asarray(spikes, dtype=bool)

         if transpose:
</code_context>
<issue_to_address>
**question (bug_risk):** Overwriting `spk_f` with a bool array may break expected numeric behavior in the non-CUDA path.

In the CPU path, `spk_f` is first computed with the correct numeric dtype (`weights.dtype`), then immediately replaced by a boolean array via `u.math.asarray(spikes, dtype=bool)`. This negates the prior dtype handling and may change semantics (e.g., when multiplying by `weights` or relying on numeric values) or add implicit casts. If the CPU path should operate on booleans, the earlier dtype-specific logic can be removed; otherwise, this new assignment should be removed or use a separate variable instead of overwriting `spk_f`.
</issue_to_address>

### Comment 4
<location path="dev/fcn/boundary_dis.py" line_range="232" />
<code_context>
+            parts.append(f"Baseline [{cmp_field}]: {self.combo_baseline.get()}")
+        return "  │  ".join(parts)
+
+    def update_plots(self):
+        if self.df is None or self.df.empty:
+            return
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the repeated grid construction, axis styling, hover handling, speedup bounds, and filter-mask logic into small helper methods to shorten the main plotting methods and make the code easier to follow and test.

One targeted way to reduce complexity here without changing behavior is to extract some of the duplicated logic into small helpers. That keeps `PerformanceBoundaryApp` intact but makes the core methods shorter and easier to follow.

### 1. Centralize grid + validity computation

`update_plots` and `_render_interpolation` both build `grid_x/y`, run `griddata`, and apply the `_N` / `limit_gb` constraints. You can encapsulate that in a helper:

```python
def _make_valid_grid(self, x, y, z, _N, limit_gb, x_max, y_max):
    grid_x, grid_y = np.mgrid[0 : x_max * 1.1 : 200j, 0 : y_max * 1.2 : 200j]
    grid_z = griddata((x, y), z, (grid_x, grid_y), method='linear')

    limit_bytes = limit_gb * (1024 ** 3)
    valid = (grid_y <= grid_x * _N) & (grid_y * grid_x * 8 * _N <= limit_bytes)
    grid_z_masked = np.where(valid, grid_z, np.nan)

    return grid_x, grid_y, grid_z_masked
```

Usage:

```python
# in _render_interpolation
grid_x, grid_y, grid_z_masked = self._make_valid_grid(x, y, z, _N, limit_gb, x_max, y_max)
im = ax.pcolormesh(grid_x, grid_y, grid_z_masked, shading='auto', cmap=cmap_name)
...

# in update_plots speedup section (after computing grid_sp)
grid_x, grid_y, grid_sp_masked = self._make_valid_grid(
    xt, yt, grid_sp, _N, limit_gb, x_max, y_max
)
self._render_speedup_interp(self.tabs["Speedup Interp"], grid_x, grid_y, grid_sp_masked,
                            subtitle=sub_sp, **kw_common)
```

This removes the duplicated validity logic and ensures boundary conditions stay consistent.

### 2. Unify common axis styling

`_render_scatter`, `_render_interpolation`, and `_render_speedup_interp` repeat axis labels, limits, grid, and title construction.

Extract a simple `_style_axes` helper:

```python
def _style_axes(self, ax, title, subtitle, _N, x_min, x_max, y_max):
    ax.set_title(f"{title}\n{subtitle}", fontsize=10, loc='left', pad=8)
    ax.set_xlabel(f"Scale  (N = {_N:,} elements per scale unit)")
    ax.set_ylabel("Connection Number  (synapses / neuron)")
    ax.set_ylim(max(y_max * 1.2, 100), 0)
    ax.set_xlim(0, x_max * 1.1)
    ax.grid(True, linestyle='--', alpha=0.3)
```

Then in each render method:

```python
# _render_scatter
self._draw_boundaries(ax, _N, limit_gb, x_min, x_max)
self._style_axes(ax, title, subtitle, _N, x_min, x_max, y_max)

# _render_interpolation
self._draw_boundaries(ax, _N, limit_gb, x_min, x_max)
self._style_axes(ax, title, subtitle, _N, x_min, x_max, y_max)

# _render_speedup_interp
self._draw_boundaries(ax, _N, limit_gb, x_min, x_max)
self._style_axes(ax, title, subtitle, _N, x_min, x_max, y_max)
```

If you want the scatter grid to be a bit darker (`alpha=0.5`) you can keep that as a parameter:

```python
def _style_axes(self, ax, title, subtitle, _N, x_min, x_max, y_max, grid_alpha=0.3):
    ...
    ax.grid(True, linestyle='--', alpha=grid_alpha)
```

### 3. Generalize hover setup

`_setup_scatter_hover` and `_setup_interp_hover` are structurally identical except for how the value is resolved. You can factor them into a single `_setup_hover` and pass in a resolver:

```python
def _setup_hover(self, tab, z_label, resolver):
    canvas, ax = tab['canvas'], tab['ax']
    annot = ax.annotate(
        "", xy=(0, 0), xytext=(15, 15), textcoords="offset points",
        bbox=dict(boxstyle="round,pad=0.4", fc="lightyellow", ec="gray", alpha=0.92),
        fontsize=8, visible=False, zorder=10,
    )

    def on_hover(event):
        if event.inaxes != ax or event.xdata is None or event.ydata is None:
            if annot.get_visible():
                annot.set_visible(False)
                canvas.draw_idle()
            return
        resolved = resolver(event)
        if resolved is None:
            if annot.get_visible():
                annot.set_visible(False)
        else:
            x, y, val, outside = resolved
            annot.xy = (x, y)
            if outside:
                annot.set_text(
                    f"Scale:  {x:.4g}\n"
                    f"Conn:   {y:.0f}\n"
                    "[outside valid domain]"
                )
            else:
                annot.set_text(
                    f"Scale:  {x:.4g}\n"
                    f"Conn:   {y:.0f}\n"
                    f"{z_label}:  {val:.4g}"
                )
            annot.set_visible(True)
        canvas.draw_idle()

    tab['hover_cid'] = canvas.mpl_connect("motion_notify_event", on_hover)
```

Resolvers:

```python
def _setup_scatter_hover(self, tab, x, y, z, z_label):
    sc = tab['scatter']
    xc, yc, zc = x.copy(), y.copy(), z.copy()

    def resolver(event):
        cont, ind = sc.contains(event)
        if not cont:
            return None
        i = ind["ind"][0]
        return xc[i], yc[i], zc[i], False

    self._setup_hover(tab, z_label, resolver)

def _setup_interp_hover(self, tab, grid_x, grid_y, grid_z, z_label):
    gx_col = grid_x[:, 0].copy()
    gy_row = grid_y[0, :].copy()
    gz = grid_z.copy()

    def resolver(event):
        ix = int(np.argmin(np.abs(gx_col - event.xdata)))
        iy = int(np.argmin(np.abs(gy_row - event.ydata)))
        val = gz[ix, iy]
        outside = np.isnan(val)
        return event.xdata, event.ydata, val if not outside else np.nan, outside

    self._setup_hover(tab, z_label, resolver)
```

This keeps behavior but consolidates the shared annotation / event wiring code.

### 4. Extract speedup color bounds logic

`_render_speedup_interp` mixes UI state reading, defaults, and constraints. Extract that into a small helper so `_render_speedup_interp` focuses on rendering:

```python
def _compute_speedup_bounds(self, z_finite):
    if self.var_auto_speedup.get():
        if len(z_finite) > 0:
            yellow_min = min(float(z_finite.min()), 0.5)
            blue_max   = max(float(z_finite.max()), 1.5)
        else:
            yellow_min, blue_max = 0.5, 1.5
    else:
        try:
            yellow_min = float(self.entry_yellow_depth.get())
            blue_max   = float(self.entry_blue_depth.get())
        except ValueError:
            yellow_min, blue_max = 0.5, 2.0

    yellow_min = min(yellow_min, 0.9999)  # must be < 1.0
    blue_max   = max(blue_max,   1.0001)  # must be > 1.0
    if yellow_min >= blue_max:
        yellow_min, blue_max = 0.5, 2.0
    return yellow_min, blue_max
```

Then:

```python
z_finite = grid_z_masked[np.isfinite(grid_z_masked)]
yellow_min, blue_max = self._compute_speedup_bounds(z_finite)
grid_sp_clipped = np.where(
    np.isfinite(grid_z_masked),
    np.clip(grid_z_masked, yellow_min, blue_max),
    np.nan,
)
...
```

### 5. Decouple filter mask building from widgets

The filtering logic inside `update_plots` is simple but could be isolated to make `update_plots` shorter and more testable:

```python
def _build_filter_mask(self, df):
    mask = pd.Series(True, index=df.index)
    for col, cb in self.comboboxes.items():
        val = cb.get()
        if val:
            mask &= (df[col].astype(str) == str(val))
    return mask
```

Usage:

```python
df_f = self.df[self._build_filter_mask(self.df)]
```

These small extractions should significantly reduce the cognitive load of `update_plots` and the render methods, while preserving all existing behavior and keeping your current class structure.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

if (row >= n_pre) return; \
if (!IS_ACTIVE(__ldg(&spikes[row]))) return; \
const int32_t* i_row = indices + (size_t)row * n_conn; \
ACC_T w0 = READ_W(weights[0]); \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Homogeneous WPR kernel uses weights[0] for all rows instead of weights[row].

w0 is loaded as READ_W(weights[0]), so all rows share the same weight instead of using their own. Since row = tid / 32, this should likely be READ_W(weights[row]) to keep per-row homogeneous weights; otherwise rows that need different weights will all use weights[0], corrupting the results.

t1 = time.time()
elapsed = t1 - t0
csv_recorder.print_row(s, n, elapsed, float(rate))
csv_recorder.single_COBA_data_add('fcnmv', data_type, back, 'post', cn, s, elapsed, float(rate), duration, homo=('homo' if homo else 'hetero'))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): NameError risk: cn is used in the prob-based branch where only actual_conn_num is defined.

In the probs_or_conn != 'conn' branch of benchmark_post_conn, within the for s in scales: loop, single_COBA_data_add is called with cn, which is not defined there. This will raise a NameError at runtime; it should likely use actual_conn_num instead.

else:
spk_f = (spikes > 0).astype(weights.dtype)

spk_f = u.math.asarray(spikes, dtype=bool)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (bug_risk): Overwriting spk_f with a bool array may break expected numeric behavior in the non-CUDA path.

In the CPU path, spk_f is first computed with the correct numeric dtype (weights.dtype), then immediately replaced by a boolean array via u.math.asarray(spikes, dtype=bool). This negates the prior dtype handling and may change semantics (e.g., when multiplying by weights or relying on numeric values) or add implicit casts. If the CPU path should operate on booleans, the earlier dtype-specific logic can be removed; otherwise, this new assignment should be removed or use a separate variable instead of overwriting spk_f.

@chaoming0625 chaoming0625 merged commit d70fc10 into main Mar 30, 2026
7 checks passed
@chaoming0625 chaoming0625 deleted the dev-fcn-optimizing-3.21 branch March 30, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants