Inquiring about the package's capabilities in specific application in Proteomics #9

eneskemalergin · 2026-03-23T05:24:52Z

eneskemalergin
Mar 23, 2026

Hi @marcoloco23, very cool project.

I'm a researcher working on uncertainty propagation in proteomics, where we need to track measurement variance through multiple aggregation levels (fragments → PSMs → peptides → proteins). I'm evaluating dimtensor for this use case. I have a few questions about how it handles correlation between derived tensors, independence assumptions, and performance for large-scale tensor operations. I have couple of questions to really understand the scope and capabilities of this great-looking idea. I was ready to build a full-fledge solution but if a tool exists for it why re-invent the wheel.
Would you be open to discussing whether dimtensor could support this workflow, or if I'd be better off building a purpose-built solution?

marcoloco23 · 2026-03-23T06:33:42Z

marcoloco23
Mar 23, 2026
Maintainer

Hi @eneskemalergin, thanks for the kind words and for taking the time to evaluate dimtensor! Your proteomics workflow sounds like a really interesting use case.

Happy to give you an honest picture of where dimtensor stands for this kind of work:

What dimtensor offers today

Two-tier uncertainty propagation (v4.5.0+):

Analytical propagation — fast, deterministic, O(N). Uncertainty propagates automatically through arithmetic (+, -, *, /, **), reductions (sum, mean, min, max), unit conversions, indexing/slicing, and array operations (concatenate, stack, split). This tier uses first-order Taylor expansion and assumes independence between inputs (uncertainties combine in quadrature).
Monte Carlo propagation — three sampling strategies (random, Latin Hypercube, Sobol). This tier does support correlations via an explicit correlation matrix and Cholesky decomposition. It also provides convergence diagnostics and confidence intervals.

Error budget analysis — GUM-compliant sensitivity decomposition that identifies which inputs dominate total uncertainty. Useful for pinpointing where measurement improvements would have the most impact.

Limitations relevant to your workflow

A few things to flag honestly:

No covariance tracking in the analytical tier — correlations between derived quantities aren't automatically maintained. If your fragment-level measurements are correlated (e.g., shared calibration), the analytical propagation won't capture that. You'd need the Monte Carlo tier with an explicit correlation matrix.
Monte Carlo currently takes scalar DimArray inputs — so propagating through large tensor operations would require some wrapper work.
Gaussian distributions only — if your measurement errors are non-Gaussian (e.g., heavy-tailed or skewed), that's not currently modeled.
No built-in hierarchical aggregation — dimtensor doesn't have native support for multi-level rollup (fragments → PSMs → peptides → proteins). You could build this on top of the existing primitives, but it wouldn't be automatic.

My honest take

dimtensor would give you correct uncertainty propagation for individual arithmetic and aggregation steps, and the Monte Carlo tier could handle correlated inputs. But for a full hierarchical proteomics pipeline with correlation tracking across aggregation levels, you'd likely need to build meaningful scaffolding on top of dimtensor — or it might make more sense as a purpose-built solution that borrows ideas from dimtensor's approach.

That said, I'd love to understand your workflow in more detail. If you could share your specific questions about correlation handling, independence assumptions, and the scale of tensors you're working with, I can give you a more concrete assessment of what would work out of the box vs. what would need extension.

Feel free to post your questions here or open separate discussions for each topic — happy to dig into the details!

0 replies

eneskemalergin · 2026-03-23T06:57:24Z

eneskemalergin
Mar 23, 2026
Author

@marcoloco23, thanks for the breakdown.

Current proteomics pipelines aggregate measured intensities into peptide-spectrum matches (PSMs). These PSMs group into peptides, which roll up into proteins.

Search tools often report confidence scores to filter reliable signals, but they discard measurement variance during downstream aggregation or assume statistical independence between correlated inputs.

I plan to build a pipeline where variance propagates natively through every stage. Every data point will carry its own variance. The system will execute operations like slicing, broadcasting, reshaping, and matrix multiplication across hundreds of thousands of PSMs. We must prevent value and variance arrays from desynchronizing during these steps.

How does dimtensor store uncertainty internally? Do values and variances sit in separate contiguous buffers, interleave, or exist as metadata attached to the array?

I need to evaluate/understand three other areas to determine if dimtensor fits this pipeline:

Correlation tracking: Preserving shared variance across multi-stage aggregations.
Performance: Scaling matrix multiplication for large datasets.
Domain operations: Inverse-variance weighted aggregations and handling zero-variance inputs.

0 replies

marcoloco23 · 2026-03-23T10:49:12Z

marcoloco23
Mar 23, 2026
Maintainer

Great questions @eneskemalergin — these get right to the heart of whether dimtensor fits your pipeline. Let me answer each one.

How uncertainty is stored internally

Values and uncertainties live in separate contiguous NumPy arrays — the DimArray object has _data (the values) and _uncertainty (absolute uncertainty, same shape and dtype). They are not interleaved. This means:

Slicing, indexing, and reshaping apply the same operation to both arrays in lockstep — they cannot desynchronize.
Broadcasting follows NumPy rules on both arrays.
Memory layout is cache-friendly for element-wise operations since each buffer is contiguous.

The uncertainty array is optional (defaults to None), so arrays without uncertainty have zero overhead.

Correlation tracking

This is the biggest gap for your use case. The analytical propagation tier does not track covariances — it stores only marginal uncertainties per element. When you add or multiply two DimArrays, it assumes independence (quadrature rule). There is no covariance matrix maintained across operations.

The Monte Carlo tier supports an explicit correlation matrix as input, but it doesn't propagate a covariance structure through a chain of operations — you provide the correlations upfront for a single function evaluation.

For a multi-stage aggregation pipeline (fragments → PSMs → peptides → proteins), you would need to either:

Carry a covariance matrix externally and feed it into each stage, or
Run the full pipeline as a single Monte Carlo evaluation (which may not be practical at your scale).

Neither is automatic today.

Performance at scale

For element-wise operations (add, multiply, divide), uncertainty propagation is O(N) with small constant overhead — it's just NumPy arithmetic on the second array. Hundreds of thousands of PSMs should be fine.

However, matmul and dot currently drop uncertainty — the propagation formula for matrix multiplication with correlated inputs requires the full covariance matrix, which we don't track. So if your pipeline relies on matrix multiplication with uncertainty, you'd hit a wall.

Domain operations

Inverse-variance weighting: Not built in. You could construct it from primitives (1/variance as weights, then weighted sum), but there's no weighted_mean(values, uncertainties) function.
Zero-variance inputs: Division by zero in relative uncertainty returns inf (handled explicitly). But inverse-variance weighting with zero variance would need special handling (infinite weight → just use that value), which isn't built in.

Bottom line

dimtensor would handle value+variance co-storage, slicing, broadcasting, and element-wise propagation reliably at scale. But the three things your pipeline specifically needs — covariance tracking through aggregation stages, uncertainty-aware matmul, and inverse-variance weighted aggregation — are not implemented today.

I think you'd be better served by a purpose-built solution for this. That said, if you're interested in contributing or co-designing a covariance-tracking extension, I'd be very open to that conversation. The internal architecture (separate contiguous buffers, _from_data_and_unit fast path) could accommodate a _covariance attribute without major refactoring.

What do you think — would a collaboration on covariance tracking be interesting, or does your timeline require a standalone solution?

0 replies

marcoloco23 · 2026-03-23T10:56:30Z

marcoloco23
Mar 23, 2026
Maintainer

Quick update @eneskemalergin — based on your questions, I've just added three features that address the practical gaps:

New in latest main:

Uncertainty-aware dot() and matmul() — these no longer drop uncertainty. For independent inputs:
```
σ²(C_ij) = Σ_k (B_kj² σ²(A_ik) + A_ik² σ²(B_kj))
```
This assumes independence between inputs but correctly propagates element-wise variance through the contraction.

weighted_mean() — inverse-variance weighted combination:

from dimtensor import weighted_mean

a = DimArray([10.0], units.m, uncertainty=[1.0])
b = DimArray([12.0], units.m, uncertainty=[2.0])
result = weighted_mean([a, b])  # 10.4 ± 0.894 m

Handles zero-variance (exact) inputs by returning them directly.

These won't solve the full covariance tracking problem for your multi-stage pipeline, but they should make the individual aggregation steps work correctly with uncertainty. Available on main now, will be in the next PyPI release.

0 replies

eneskemalergin · 2026-03-23T11:05:26Z

eneskemalergin
Mar 23, 2026
Author

Wow, thanks for the reply and the quick fixes to the updates on dot and matmul to keep uncertainty.

Seems like, apart from things carrying over, a lot of the core needs are already within your project. I think I will take a look and build a quick demo/application to confirm the carryover is patched outside the dimtensor, or, if that is too much of a hindrance to performance or tracking, I will consider expanding this project before deciding on creating my own package.

I will let you know about the progress (will likely work on Friday/next weekend).

Appreciate the very helpful responses and willing to be of help :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiring about the package's capabilities in specific application in Proteomics #9

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Inquiring about the package's capabilities in specific application in Proteomics #9

Uh oh!

eneskemalergin Mar 23, 2026

Replies: 5 comments

Uh oh!

marcoloco23 Mar 23, 2026 Maintainer

What dimtensor offers today

Limitations relevant to your workflow

My honest take

Uh oh!

eneskemalergin Mar 23, 2026 Author

Uh oh!

marcoloco23 Mar 23, 2026 Maintainer

How uncertainty is stored internally

Correlation tracking

Performance at scale

Domain operations

Bottom line

Uh oh!

marcoloco23 Mar 23, 2026 Maintainer

New in latest main:

Uh oh!

eneskemalergin Mar 23, 2026 Author

eneskemalergin
Mar 23, 2026

marcoloco23
Mar 23, 2026
Maintainer

eneskemalergin
Mar 23, 2026
Author

marcoloco23
Mar 23, 2026
Maintainer

marcoloco23
Mar 23, 2026
Maintainer

eneskemalergin
Mar 23, 2026
Author