Enhance GUI and benchmarking tools, add wpr for post and scheduling curve has been improved#106
Conversation
The float-to-bool conversion of the mv operator will proceed after thorough testing
- Implemented `COBA_2005_binary_fcnmv_boundary_CsvOuput.py` for benchmarking post and pre-synaptic connection updates using JAX. - Created `boundary_dis.py` for a GUI application to visualize performance boundaries and speedup analysis with interactive features. - Developed `dev_COBA_binary_fcnmv.py` for benchmarking with various connection probabilities and numbers, enhancing simulation capabilities. - Added error handling and CSV recording functionalities to capture benchmarking results effectively.
Reviewer's GuideAdds warp-per-row (WPR) CUDA kernels and an auto-dispatch path for binary fcnmv scatter, updates Python FFI/kernel selection to use boolean spikes consistently, and enhances benchmarking/GUI tooling for exploring performance boundaries and backend comparisons. Sequence diagram for binary fcnmv scatter path with WPR/TPR auto-dispatchsequenceDiagram
participant Caller as python_op
participant KernelSel as _binary_fcnmv_cuda_kernel
participant JAX as jax_ffi
participant FFIHomo as binary_fcnmv_scatter_homo
participant FFIHetero as binary_fcnmv_scatter_hetero
participant WPRKern as _bs_wpr_*_kern
participant TPRKern as _bs_tpr_*_kern
Caller->>KernelSel: call(transpose=True, mode, spikes, weights, indices)
KernelSel->>KernelSel: build kernel_name using suffix _bool
KernelSel->>KernelSel: spikes = u.math.asarray(spikes, dtype=bool)
KernelSel->>JAX: jax.ffi.ffi_call(kernel_name, out_info)
JAX-->>Caller: returns callable kernel(weights, indices, spikes_bool)
Caller->>JAX: kernel(weights, indices, spikes_bool)
JAX->>FFIHomo: binary_fcnmv_scatter_homo(weights, indices, spikes_bool, output, stream)
JAX->>FFIHetero: binary_fcnmv_scatter_hetero(weights, indices, spikes_bool, output, stream)
activate FFIHomo
FFIHomo->>FFIHomo: read n_pre, n_conn, n_post from tensors
FFIHomo->>FFIHomo: cudaMemsetAsync(output)
FFIHomo->>FFIHomo: if n_pre == 0 return
FFIHomo->>FFIHomo: compute bsz = 256
alt n_conn * 2084000 > n_pre * 1539
FFIHomo->>FFIHomo: warps_per_block = bsz / 32
FFIHomo->>FFIHomo: n_blocks_wpr = ceil(n_pre / warps_per_block)
FFIHomo->>WPRKern: launch _bs_wpr_homo_kern<<<n_blocks_wpr, 256>>>
else
FFIHomo->>FFIHomo: n_blocks_tpr = ceil(n_pre / bsz)
FFIHomo->>TPRKern: launch _bs_tpr_homo_kern<<<n_blocks_tpr, 256>>>
end
deactivate FFIHomo
activate FFIHetero
FFIHetero->>FFIHetero: same dispatch logic for hetero weights
alt n_conn * 2084000 > n_pre * 1539
FFIHetero->>WPRKern: launch _bs_wpr_hetero_kern
else
FFIHetero->>TPRKern: launch _bs_tpr_hetero_kern
end
deactivate FFIHetero
WPRKern-->>JAX: complete scatter accumulation
TPRKern-->>JAX: complete scatter accumulation
JAX-->>Caller: updated output tensor
Sequence diagram for CSV_record tagging and row enrichmentsequenceDiagram
actor Dev as benchmark_script
participant BT as BenchmarkToolsModule
participant CSV as CSV_record
Dev->>BT: csv_recorder = CSV_record(...)
Dev->>CSV: add_tag("warp_or_thread", "tpr")
CSV->>CSV: _tags["warp_or_thread"] = "tpr"
CSV->>CSV: ensure fieldnames contains tag key
loop for each measurement
Dev->>CSV: add_row(raw_row)
CSV->>CSV: merged_row = dict(_tags)
CSV->>CSV: merged_row.update(raw_row)
CSV->>CSV: extend fieldnames with keys from merged_row
CSV->>CSV: rows.append(merged_row)
end
Dev->>CSV: record_finish(run_tag)
CSV->>CSV: _write_csv(file_name, rows, fieldnames, mode)
CSV-->>Dev: CSV file with tag columns merged into all rows
Class diagram for CSV_record benchmarking helper and PerformanceBoundaryApp GUIclassDiagram
class CSV_record {
+str name
+str suffix
+Path output_dir
+bool append
+list fieldnames
+list rows
+dict _tags
+CSV_record(name, operator, suite, duration, conn)
+_write_csv(file_name, rows, fieldnames, mode)
+add_tag(tag_name, tag_value)
+add_row(row)
+single_COBA_data_add(operator, data_type, backend, mode, conn_num, scale, elapsed, rate, duration, homo)
+print_header(operator, data_type, backend, mode, conn_num, duration, homo, prob)
+print_table_header(show_conn)
+print_row(scale, neuron_count, elapsed, rate, conn_num)
+record_finish(tag)
}
class BenchmarkToolsModule {
+generate_params(dis_type, _N, limit_gb, target_samples, data_size, scale_max, conn_max) list
+memory_limit(conn_nums, scale, _N, limit, data_type) bool
}
class PerformanceBoundaryApp {
-Tk root
-DataFrame df
-dict comboboxes
-Notebook notebook
+PerformanceBoundaryApp(root)
+load_data()
+update_plots()
+export_image()
+_setup_ui()
+_on_compare_field_changed(event)
+_rebuild_filter_row()
+_on_auto_speedup_changed()
+_subtitle(include_baseline) str
+_render_scatter(tab, x, y, z, _N, limit_gb, x_min, x_max, y_max, z_label, cmap_name, subtitle)
+_render_interpolation(tab, x, y, z, _N, limit_gb, x_min, x_max, y_max, z_label, cmap_name, subtitle)
+_render_speedup_interp(tab, grid_x, grid_y, grid_z_masked, _N, limit_gb, x_min, x_max, y_max, subtitle)
+_draw_boundaries(ax, _N, limit_gb, x_min, x_max)
+_draw_contours_and_labels(ax, grid_x, grid_y, grid_z_masked, z_pts)
+_draw_custom_contours(ax, grid_x, grid_y, grid_z_masked)
}
class COBABenchmarkScripts {
+benchmark_post_conn(conn_num, conn_prob, data_type, duration, homo, backend, probs_or_conn, _N)
+benchmark_pre_conn(conn_num, conn_prob, data_type, duration, homo, backend, probs_or_conn, _N)
}
BenchmarkToolsModule --> CSV_record : uses
COBABenchmarkScripts --> CSV_record : instantiates
COBABenchmarkScripts --> BenchmarkToolsModule : calls
PerformanceBoundaryApp --> BenchmarkToolsModule : consumes CSV output
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The WPR/TPR auto-dispatch threshold for scatter (
(int64_t)n_conn * 2084000 > (int64_t)n_pre * 1539) is currently inlined as magic numbers inside the FFI macros; consider extracting this into a small helper or named constants with a short comment (e.g. referring to the fit fromboundary_dis.py) so it’s easier to adjust and reason about across architectures. - In
_binary_fcnmv_cuda_kernelthe non-TPR path now always targets_boolkernels and castsspikesto boolean, but there is still leftover code computingspk_fin the original dtype that is immediately overwritten; it would be clearer to remove the dead path and centralize the boolean casting in one place so the semantics ofspikesfor both gather and scatter are unambiguous.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The WPR/TPR auto-dispatch threshold for scatter (`(int64_t)n_conn * 2084000 > (int64_t)n_pre * 1539`) is currently inlined as magic numbers inside the FFI macros; consider extracting this into a small helper or named constants with a short comment (e.g. referring to the fit from `boundary_dis.py`) so it’s easier to adjust and reason about across architectures.
- In `_binary_fcnmv_cuda_kernel` the non-TPR path now always targets `_bool` kernels and casts `spikes` to boolean, but there is still leftover code computing `spk_f` in the original dtype that is immediately overwritten; it would be clearer to remove the dead path and centralize the boolean casting in one place so the semantics of `spikes` for both gather and scatter are unambiguous.
## Individual Comments
### Comment 1
<location path="brainevent/_fcn/binary_fcnmv.cu" line_range="184" />
<code_context>
+ if (row >= n_pre) return; \
+ if (!IS_ACTIVE(__ldg(&spikes[row]))) return; \
+ const int32_t* i_row = indices + (size_t)row * n_conn; \
+ ACC_T w0 = READ_W(weights[0]); \
+ for (int k = lane; k < n_conn; k += 32) { \
+ int idx = __ldg(&i_row[k]); \
</code_context>
<issue_to_address>
**issue (bug_risk):** Homogeneous WPR kernel uses weights[0] for all rows instead of weights[row].
`w0` is loaded as `READ_W(weights[0])`, so all rows share the same weight instead of using their own. Since `row = tid / 32`, this should likely be `READ_W(weights[row])` to keep per-row homogeneous weights; otherwise rows that need different weights will all use `weights[0]`, corrupting the results.
</issue_to_address>
### Comment 2
<location path="dev/fcn/dev_COBA_binary_fcnmv.py" line_range="103" />
<code_context>
+ t1 = time.time()
+ elapsed = t1 - t0
+ csv_recorder.print_row(s, n, elapsed, float(rate))
+ csv_recorder.single_COBA_data_add('fcnmv', data_type, back, 'post', cn, s, elapsed, float(rate), duration, homo=('homo' if homo else 'hetero'))
+
+ except Exception as e:
</code_context>
<issue_to_address>
**issue (bug_risk):** NameError risk: `cn` is used in the prob-based branch where only `actual_conn_num` is defined.
In the `probs_or_conn != 'conn'` branch of `benchmark_post_conn`, within the `for s in scales:` loop, `single_COBA_data_add` is called with `cn`, which is not defined there. This will raise a `NameError` at runtime; it should likely use `actual_conn_num` instead.
</issue_to_address>
### Comment 3
<location path="brainevent/_fcn/binary.py" line_range="304" />
<code_context>
else:
spk_f = (spikes > 0).astype(weights.dtype)
+ spk_f = u.math.asarray(spikes, dtype=bool)
if transpose:
</code_context>
<issue_to_address>
**question (bug_risk):** Overwriting `spk_f` with a bool array may break expected numeric behavior in the non-CUDA path.
In the CPU path, `spk_f` is first computed with the correct numeric dtype (`weights.dtype`), then immediately replaced by a boolean array via `u.math.asarray(spikes, dtype=bool)`. This negates the prior dtype handling and may change semantics (e.g., when multiplying by `weights` or relying on numeric values) or add implicit casts. If the CPU path should operate on booleans, the earlier dtype-specific logic can be removed; otherwise, this new assignment should be removed or use a separate variable instead of overwriting `spk_f`.
</issue_to_address>
### Comment 4
<location path="dev/fcn/boundary_dis.py" line_range="232" />
<code_context>
+ parts.append(f"Baseline [{cmp_field}]: {self.combo_baseline.get()}")
+ return " │ ".join(parts)
+
+ def update_plots(self):
+ if self.df is None or self.df.empty:
+ return
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the repeated grid construction, axis styling, hover handling, speedup bounds, and filter-mask logic into small helper methods to shorten the main plotting methods and make the code easier to follow and test.
One targeted way to reduce complexity here without changing behavior is to extract some of the duplicated logic into small helpers. That keeps `PerformanceBoundaryApp` intact but makes the core methods shorter and easier to follow.
### 1. Centralize grid + validity computation
`update_plots` and `_render_interpolation` both build `grid_x/y`, run `griddata`, and apply the `_N` / `limit_gb` constraints. You can encapsulate that in a helper:
```python
def _make_valid_grid(self, x, y, z, _N, limit_gb, x_max, y_max):
grid_x, grid_y = np.mgrid[0 : x_max * 1.1 : 200j, 0 : y_max * 1.2 : 200j]
grid_z = griddata((x, y), z, (grid_x, grid_y), method='linear')
limit_bytes = limit_gb * (1024 ** 3)
valid = (grid_y <= grid_x * _N) & (grid_y * grid_x * 8 * _N <= limit_bytes)
grid_z_masked = np.where(valid, grid_z, np.nan)
return grid_x, grid_y, grid_z_masked
```
Usage:
```python
# in _render_interpolation
grid_x, grid_y, grid_z_masked = self._make_valid_grid(x, y, z, _N, limit_gb, x_max, y_max)
im = ax.pcolormesh(grid_x, grid_y, grid_z_masked, shading='auto', cmap=cmap_name)
...
# in update_plots speedup section (after computing grid_sp)
grid_x, grid_y, grid_sp_masked = self._make_valid_grid(
xt, yt, grid_sp, _N, limit_gb, x_max, y_max
)
self._render_speedup_interp(self.tabs["Speedup Interp"], grid_x, grid_y, grid_sp_masked,
subtitle=sub_sp, **kw_common)
```
This removes the duplicated validity logic and ensures boundary conditions stay consistent.
### 2. Unify common axis styling
`_render_scatter`, `_render_interpolation`, and `_render_speedup_interp` repeat axis labels, limits, grid, and title construction.
Extract a simple `_style_axes` helper:
```python
def _style_axes(self, ax, title, subtitle, _N, x_min, x_max, y_max):
ax.set_title(f"{title}\n{subtitle}", fontsize=10, loc='left', pad=8)
ax.set_xlabel(f"Scale (N = {_N:,} elements per scale unit)")
ax.set_ylabel("Connection Number (synapses / neuron)")
ax.set_ylim(max(y_max * 1.2, 100), 0)
ax.set_xlim(0, x_max * 1.1)
ax.grid(True, linestyle='--', alpha=0.3)
```
Then in each render method:
```python
# _render_scatter
self._draw_boundaries(ax, _N, limit_gb, x_min, x_max)
self._style_axes(ax, title, subtitle, _N, x_min, x_max, y_max)
# _render_interpolation
self._draw_boundaries(ax, _N, limit_gb, x_min, x_max)
self._style_axes(ax, title, subtitle, _N, x_min, x_max, y_max)
# _render_speedup_interp
self._draw_boundaries(ax, _N, limit_gb, x_min, x_max)
self._style_axes(ax, title, subtitle, _N, x_min, x_max, y_max)
```
If you want the scatter grid to be a bit darker (`alpha=0.5`) you can keep that as a parameter:
```python
def _style_axes(self, ax, title, subtitle, _N, x_min, x_max, y_max, grid_alpha=0.3):
...
ax.grid(True, linestyle='--', alpha=grid_alpha)
```
### 3. Generalize hover setup
`_setup_scatter_hover` and `_setup_interp_hover` are structurally identical except for how the value is resolved. You can factor them into a single `_setup_hover` and pass in a resolver:
```python
def _setup_hover(self, tab, z_label, resolver):
canvas, ax = tab['canvas'], tab['ax']
annot = ax.annotate(
"", xy=(0, 0), xytext=(15, 15), textcoords="offset points",
bbox=dict(boxstyle="round,pad=0.4", fc="lightyellow", ec="gray", alpha=0.92),
fontsize=8, visible=False, zorder=10,
)
def on_hover(event):
if event.inaxes != ax or event.xdata is None or event.ydata is None:
if annot.get_visible():
annot.set_visible(False)
canvas.draw_idle()
return
resolved = resolver(event)
if resolved is None:
if annot.get_visible():
annot.set_visible(False)
else:
x, y, val, outside = resolved
annot.xy = (x, y)
if outside:
annot.set_text(
f"Scale: {x:.4g}\n"
f"Conn: {y:.0f}\n"
"[outside valid domain]"
)
else:
annot.set_text(
f"Scale: {x:.4g}\n"
f"Conn: {y:.0f}\n"
f"{z_label}: {val:.4g}"
)
annot.set_visible(True)
canvas.draw_idle()
tab['hover_cid'] = canvas.mpl_connect("motion_notify_event", on_hover)
```
Resolvers:
```python
def _setup_scatter_hover(self, tab, x, y, z, z_label):
sc = tab['scatter']
xc, yc, zc = x.copy(), y.copy(), z.copy()
def resolver(event):
cont, ind = sc.contains(event)
if not cont:
return None
i = ind["ind"][0]
return xc[i], yc[i], zc[i], False
self._setup_hover(tab, z_label, resolver)
def _setup_interp_hover(self, tab, grid_x, grid_y, grid_z, z_label):
gx_col = grid_x[:, 0].copy()
gy_row = grid_y[0, :].copy()
gz = grid_z.copy()
def resolver(event):
ix = int(np.argmin(np.abs(gx_col - event.xdata)))
iy = int(np.argmin(np.abs(gy_row - event.ydata)))
val = gz[ix, iy]
outside = np.isnan(val)
return event.xdata, event.ydata, val if not outside else np.nan, outside
self._setup_hover(tab, z_label, resolver)
```
This keeps behavior but consolidates the shared annotation / event wiring code.
### 4. Extract speedup color bounds logic
`_render_speedup_interp` mixes UI state reading, defaults, and constraints. Extract that into a small helper so `_render_speedup_interp` focuses on rendering:
```python
def _compute_speedup_bounds(self, z_finite):
if self.var_auto_speedup.get():
if len(z_finite) > 0:
yellow_min = min(float(z_finite.min()), 0.5)
blue_max = max(float(z_finite.max()), 1.5)
else:
yellow_min, blue_max = 0.5, 1.5
else:
try:
yellow_min = float(self.entry_yellow_depth.get())
blue_max = float(self.entry_blue_depth.get())
except ValueError:
yellow_min, blue_max = 0.5, 2.0
yellow_min = min(yellow_min, 0.9999) # must be < 1.0
blue_max = max(blue_max, 1.0001) # must be > 1.0
if yellow_min >= blue_max:
yellow_min, blue_max = 0.5, 2.0
return yellow_min, blue_max
```
Then:
```python
z_finite = grid_z_masked[np.isfinite(grid_z_masked)]
yellow_min, blue_max = self._compute_speedup_bounds(z_finite)
grid_sp_clipped = np.where(
np.isfinite(grid_z_masked),
np.clip(grid_z_masked, yellow_min, blue_max),
np.nan,
)
...
```
### 5. Decouple filter mask building from widgets
The filtering logic inside `update_plots` is simple but could be isolated to make `update_plots` shorter and more testable:
```python
def _build_filter_mask(self, df):
mask = pd.Series(True, index=df.index)
for col, cb in self.comboboxes.items():
val = cb.get()
if val:
mask &= (df[col].astype(str) == str(val))
return mask
```
Usage:
```python
df_f = self.df[self._build_filter_mask(self.df)]
```
These small extractions should significantly reduce the cognitive load of `update_plots` and the render methods, while preserving all existing behavior and keeping your current class structure.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if (row >= n_pre) return; \ | ||
| if (!IS_ACTIVE(__ldg(&spikes[row]))) return; \ | ||
| const int32_t* i_row = indices + (size_t)row * n_conn; \ | ||
| ACC_T w0 = READ_W(weights[0]); \ |
There was a problem hiding this comment.
issue (bug_risk): Homogeneous WPR kernel uses weights[0] for all rows instead of weights[row].
w0 is loaded as READ_W(weights[0]), so all rows share the same weight instead of using their own. Since row = tid / 32, this should likely be READ_W(weights[row]) to keep per-row homogeneous weights; otherwise rows that need different weights will all use weights[0], corrupting the results.
| t1 = time.time() | ||
| elapsed = t1 - t0 | ||
| csv_recorder.print_row(s, n, elapsed, float(rate)) | ||
| csv_recorder.single_COBA_data_add('fcnmv', data_type, back, 'post', cn, s, elapsed, float(rate), duration, homo=('homo' if homo else 'hetero')) |
There was a problem hiding this comment.
issue (bug_risk): NameError risk: cn is used in the prob-based branch where only actual_conn_num is defined.
In the probs_or_conn != 'conn' branch of benchmark_post_conn, within the for s in scales: loop, single_COBA_data_add is called with cn, which is not defined there. This will raise a NameError at runtime; it should likely use actual_conn_num instead.
| else: | ||
| spk_f = (spikes > 0).astype(weights.dtype) | ||
|
|
||
| spk_f = u.math.asarray(spikes, dtype=bool) |
There was a problem hiding this comment.
question (bug_risk): Overwriting spk_f with a bool array may break expected numeric behavior in the non-CUDA path.
In the CPU path, spk_f is first computed with the correct numeric dtype (weights.dtype), then immediately replaced by a boolean array via u.math.asarray(spikes, dtype=bool). This negates the prior dtype handling and may change semantics (e.g., when multiplying by weights or relying on numeric values) or add implicit casts. If the CPU path should operate on booleans, the earlier dtype-specific logic can be removed; otherwise, this new assignment should be removed or use a separate variable instead of overwriting spk_f.
…pact, Enhance the boundary dis
This pull request introduces a major performance and scalability improvement to the binary fully-connected matrix-vector (fcnmv) CUDA kernels, specifically optimizing the scatter mode for high fan-out (large n_conn) scenarios. The update adds new warp-per-row (WPR) CUDA kernels, implements an auto-dispatch mechanism to select between thread-per-row (TPR) and WPR based on problem size, and refines the Python interface accordingly. Additionally, there are minor improvements in code organization and initialization.
CUDA kernel improvements:
_bs_wpr_homo_kern,_bs_wpr_hetero_kern) for scatter mode, which improve performance for large n_conn by parallelizing atomicAdd operations across a warp.binary_fcnmv_scatter_homo,binary_fcnmv_scatter_hetero) to auto-dispatch between TPR and WPR kernels based on a cubic threshold function of n_pre and n_conn, ensuring optimal kernel selection for different matrix sizes. [1] [2]Python interface and logic:
_binary_fcnmv_cuda_kernelto use new kernel names and ensure the spikes array is always boolean for scatter/gather kernels.Documentation and code organization:
binary_fcnmv.cuto explain the new TPR/WPR auto-dispatch and performance crossover, replacing outdated comments. [1] [2]_tagsdictionary inBenchmarkTools.py(renamed fromCsvOutput.py).These changes significantly improve the scalability and efficiency of the binary fcnmv scatter operation, especially for large, high-fan-out neural network layers.
Summary by Sourcery
Introduce warp-per-row CUDA scatter kernels with auto-dispatch, update Python bindings to use boolean spike inputs, and extend benchmarking/GUI tooling for backend-focused analysis and performance boundary exploration.
New Features:
Enhancements:
Documentation: