Skip to content

Conversation

@thorstenhater
Copy link
Contributor

@thorstenhater thorstenhater commented Feb 6, 2025

Intro

This PR explores a few directions for simplifying the sampling interface from the user perspective.
It's not the only way to approach the problems and possibly not the best.

Grievances

There are several stumbling blocks from that perspective that make the use of samplers unergonomic and hard to
communicate

Metadata

Metadata is delivered without type information as std::any and must be cast. Missing a const has consequences.
The type must be determined by knowing the metadata type associatated the probe set in the recipe. Samplers and
probes aren't connected directly to each other, but via a handler that doesn't carry (or allows retrieval of) more
information. There are currently four types: mlocation, [mcable], point_info, and [point_info].

Scalar and Vector probes

Full cell probes generate samples with a vector payload [double] and vector metadata while point probes generate what's essentially a vector of scalar (payload/metadata) probes.

More any

Each entry in the payload needs to cast from any to a scalar or vector (of doubles). Again, missing const is fatal.

Summary Effect

The user is required to distinguish four metadata types and apply the correct casts, while also applying matching casts
to the payload. The information is disaggregated over simulation and recipe code, thus using existing recipes comes at
an extra cost. Moreover, simulations are somewhat brittle towards changes in the probe definitions, which require a full
review of the attached samplers. Despite the obvious implication of metadata type on payload type, the user is required
to track this information.

In Python this might seem less problematic due to dynamic typing, yet differentiating between lists and scalars leads to
different code, which make simulation scripts reliant on dynamic type info (isinstance and ilk).

Possible Remedies and Status.

A list of ideas to reduce churn and verbosity.

  • Link metadata and payload type at type level, reifying the 'vector metadata implies vector payload' relation. Leads to less unconstrained anycasts. Seemed too little of an effect to implement.
  • Always make metadata and payload a vector. That's what I decided on here; it's the most radical approach. It removes one anycast (per entry!) completely. I also removed the differentiation between location and cables, which I am less fond of. There's also the question whether [any=mlocation|mcable|point_info] or any=[mlocation]|... is preferable, neither is really expression what's going on, namely a homogeneous vector of unknown type.
  • Make multi-location probes produce vector samplers, too. Again, included in the PR, since it seems more intuitive.
  • Allow retrieval of samples by (gid, 'tag') (aka cell_address_type). Decided against it, as it doesn't help with the core (in my view) problems; users are still needed to track probe <> sampler relations.
  • Change the metadata type from any to variant<...>. Didn't add this -- for now -- since it requires one header to have all the metadata types. Although the set of possible metadata isn't really open, it's possible to change one type without rebuilding all of Arbor. On the positive side, it's more typesafe and transparent to the user. Doesn't help with Python, though.

As implemented, C++ and Python code for sampling is now significantly less busy, see probe-demo.cpp. The backend code is also somewhat more straightforward. Some conceptual feedback is needed before continuing.

EDIT: As sensible compromise might be this (not literally, but conceptually, as these are function arguments in Arbor.)

struct samples {
  size_t width; // count for metadata and value columns
  sizet_t n_samples; // count of sample times and value rows
  std::variant<mcable*, mlocation*, point_info*> meta; // could be std::any
  double* values; // payload data, size = n_samples * width
  double* time; // sample times, size = n_samples
};

@halfflat
Copy link
Contributor

halfflat commented Feb 7, 2025

Hi Thorsten,

I strongly believe throwing away the genericity of the sample_record is a bad idea; not everything we have now is best represented by a vector of doubles, and certainly not everything we might conceivably represent in the future. It also prevents us from using the type system to ensure that the concordance of probe types and sample consumers. What is the motivation for discarding this?

The simple sampler etc. and cable cell-specific sampling data representations can be changed if there's a need, but reducing a relatively small amount of complexity in representation through the removal of a large amount of expressive power is a bad trade-off.

@thorstenhater
Copy link
Contributor Author

Hi @halfflat,

I have been doing a lot of model building / collaboration lately and terms of practical end-user experience,
the sample extraction is powerful, but annoying to actually use. This concerns cable cells almost exclusively.

There are four types of meta data (mlocation, [mcable], point_info, [point_info]) and two types of payload (double, [double]). In C++ we therefore need to either know the type of probe or rely on RTTI in some capacity. This is made
worse by the fact that information is split over recipe and simulation. Also, locsets with multiple entries produce
multiple streams, but the *_cell probes probes generate one stream with a vector of entries. That seems inconsistent.
The performance is secondary, but paying an any_cast plus a dereference per sampled value seems expensive.

The Python interface is disjoint from the C++ API. Here we deal with tables of samples at the end of simulation::run.
The difference between 'vector' and 'scalar' probes persists, requiring users to deal with lists and scalar valued metadata.

The way interpolation works has bitten me and others before.

Overall, my experience is that sampling is quite complex, more powerful than needed in the cases I have come across,
and thus hard to teach. On the implementation side there's a lot of code dedicated to maintaining the complexity.
This has led me to believe that some simplification is required and this my concrete plan for it

  1. Turn all values into vectors (C++ these become ranges, Python will produce tables always)
  2. Collapse scalar and vector probes (optionally also use the same metadata)
  3. Decide on double as the one true payload

Net effect would be

  1. two types of meta data: [mcable], [point_info]
  2. one type of payload: [double]

thus significantly less casting in (C++) user code and only ever the need to deal with one table per probe in Python.
At the same time, the implementation code becomes simpler (currently slightly net negative, but I didn't start removing
scalar probes, yet.) One might argue that performance might be improved, too, since we have less casts and less data
duplication (the time column will occur only once per locset probe). All at the cost of a second pointer, however, that
would be payed by all probes, not only cable probes. We could work around that by having a double* instead of
a range and extract the size from the metadata list instead. Another choice would be to use a vector of variants for
the metadata, but that also feels off since the type of the elements is homogeneous.

Next, I want to allow switching off interpolation and add a few canned samplers in addition to simple_sampler,
like streaming out to (HDF5) files. (Also, I was thinking about removing any in favour of variant, as the system isn't
really open, but I think that's just churn without payoff)

So far, I am just playing with these ideas to see how the effect on production code looks like.
I am very interested in your (and other people's) input here, since this would impact a lot of user code. I am still convinced,
though, that a simpler interface with less divergent paths is a good thing here. My core question would be where and when the power offered by this setup would be used.

@thorstenhater
Copy link
Contributor Author

Also, since I am now done with the usage code in C++, you could take a look at probe-demo.cpp and see what you think from the user's point of view, @halfflat

@thorstenhater thorstenhater changed the title Py/stream lined probing More uniform sampling Feb 12, 2025
@halfflat
Copy link
Contributor

Just on a few of the points raised in description and your notes above, before I look more thoroughly at the probe-demo code.

  • The necessity of alignment between the expected metadata type or sample data type and those provided by the simulation should, in my opinion, be regarded as a feature rather than a curse: it's there, instead of a more lax alternative, to ensure that there is semantic alignment between the probe specifications and the sampler handles.

    For example, the metadata for describing a mechanism state variable is much more complex than that for a membrane voltage probe. If there were both represented by the same type admitting a more-or-less arbitrary representation (e.g. a JSON representation), the type system won't be there to stop a sampler which expects state data being misapplied to a sampler which expects voltage data. Similarly, differences in sample data representation presents a semantic promise enforced through the type system and moreover would be necessary in any case for hypothetical probes which return more structured data.

  • There is type data in the metadata and sample data returns, but it's only useful for verifying that the type you are getting is the type you expect. What type you should expect is a property of the API and expressed in the documentation; I acknowledge that errors are caught at run-time rather than link-time, but as far as I can tell there is no other way to express this though a generic C++ interface. The run time checks though do prevent the class of 'dynamic typing' sort of errors that a user would otherwise potentially face, as in the previous point.

  • any_cast on an any_ptr is light-weight: it simply compares the type-id and returns nullptr or the actual pointer accordingly. The type comparison is cheap for the expected true case, typically coming down to a pointer equality test (this can give a false negative in some cases with dynamically loaded objects and then take longer iiuc).

    If the overhead of any_cast on each sample value is a concern, we could change the sampling callback function type to std::function<void (probe_metadata, std::size_t, const time_type* time, any_ptr data)> though it could make the casting a little less intuitive: n values of type double would either be represented by e.g. a const pointer to n const pointer values ( const double * const *), or we shuffle things in the cell group to remove the second indirection and just wrap a const double *.

  • The historical reason for returning point-samples with a locset address separately is because we were sort of forced into it by our morphological algebra — it's easier to allow a general locset than to assert that the locset resolves to a single point. But I still feel it's a useful convention: sampling over a probe with basis locset that resolves to two actual locations x and y is much more like sampling over two probes whose bases resolve to x and y respectively. In each case there is an invocation of the sampling callback per concrete location.

  • The sample value types and metadata types shouldn't be regarded as a closed set if we're still in favour of Bring Your Own Cell Group. Which I still am, to be honest.

  • Decoupling sampling plumbing from probe descriptions allows for the same recipe to be used with a variety or a subset of sampling schemes. It allows certain sorts of recipes (i.e. not in their full generality) to be represented in a 'flat' way, e.g. by text files. The same model of cells and connectivity can be used then in different contexts. It also allows for sampling to be changed at run time, e.g. by sampling more stuff more often if the user code sees something interesting.

I think there are two issues that are at the root of the irritations:

  1. Having an open set of sample types is awkward to accommodate in Python, especially because for performance reasons we don't support sampler callbacks into python code. Instead we have to somehow generically accumulate them for presentation at the end.

  2. Generic handling of probe results, as in probe-demo and in the python sample handling, is a poor fit for a variety of sample types.

On the second point, I think the generic use case, where user code really doesn't care about the semantic content of the probed data but just wants to present it in a generic way, is the less import case when compared to situation where the user code expects particular sorts of probed data for the purposes of further analysis, where a generic representation is less safe than a type-checked representation.

But that doesn't mean the generic case shouldn't be better supported. A possible approach is to include in the probe metadata a (possibly empty) function that maps the type-abstracted sample records to e.g. a JSON or flatbuffer (plus flatbuffer schema) representation that in-turn can be parsed and processed in a generic way by the user code.

For the first point, I have been wondering if we could adopt a Python async approach: the python lib sets up callbacks that push sample data into queues which are drained asynchronously and which block forward simulation progress until the Python user side code pulls out enough sample data. It would avoid, potentially, the issue of keeping all the sample data in memory before any processing or I/O can deal with it.

@thorstenhater
Copy link
Contributor Author

thorstenhater commented Feb 13, 2025

The performance / memory concerns of py::simulation::run giving all data in one chunk isn't on the list of grievances, but very much on my mind. I don't think that injecting async is going to help, as it would require an async runtime and runs into the coloured function issue. It also doesn't make the overhead between Python and C++ disappear. My current recommendation to users is to loop py::simulation::run, if it becomes an issue.

I will think through your points more carefully, but I want to stress the central point that I hid among the details: This is about the user interface, because an elegant design doesn't help if it's annoying to use. The Rust folks use 'ergonomic', which fits the idea really well. I am also not the first person to think about sampling from this point of view, see #1929

@halfflat
Copy link
Contributor

I guess I can't really see how the current design (in broad strokes; details of course can be improved and simple_sampler also is a mess) is less frustrating than the alternative. Squishing everything into the one data representation throws away flexibility and type-safety — we're swapping one problem for another. If you can verify the type, you know the data structure and can work with it. If you have to infer the structure and meaning of the data from a generic representation, you have to do a lot more runtime work to avoid semantic errors.

But also, and this is something I really don't see, what is so awful about calling any_cast?

@halfflat
Copy link
Contributor

Just to expand on the first part of the reply: if we represent a set of mlocations as mcables, something that gets mcables expecting a set of mlocations would then have to verify that the mcables describe a point set if they want to ensure that they haven't hooked themselves up to the wrong probe. Just returning the correct type avoids this problem in the first place.

@thorstenhater
Copy link
Contributor Author

* The necessity of alignment between the expected metadata type or sample data type and those provided by the simulation should, in my opinion, be regarded as a feature rather than a curse: it's there, instead of a more lax alternative, to ensure that there is semantic alignment between the probe specifications and the sampler handles.

Sure, that isn't up to debate and was not relaxed. The core change is to acknowledge that most often things come in bunches, so instead of one value, there's now a row of values. What I wanted to point out is that there's an injective
relation metadata -> value_type that's only fixed in the documentation (and checked at runtime).

  For example, the metadata for describing a mechanism state variable is much more complex than that for a membrane voltage probe. If there were both represented by the same type admitting a more-or-less arbitrary representation (e.g. a JSON representation), the type system won't be there to stop a sampler which expects state data being misapplied to a sampler which expects voltage data. Similarly, differences in sample data representation presents a semantic promise enforced through the type system and moreover would be necessary in any case for hypothetical probes which return more structured data.

Agreed, that wasn't really changed, except me overeagerly flattening locations and cables, which I am unhappy with, see above. Expect that to be reverted. Metadata is still defined by the probe type.

I am less sure about the value type though. I am not really fond of building for generality without the need and currently none of our probes do return more than a value or an array thereof.

* The historical reason for returning point-samples with a locset address separately is because we were sort of forced into it by our morphological algebra — it's easier to allow a general locset than to assert that the locset resolves to a single point. But I still feel it's a useful convention: sampling over a probe with basis locset that resolves to two actual locations _x_ and _y_ is much more like sampling over two probes whose bases resolve to _x_ and _y_ respectively. In each case there is an invocation of the sampling callback per concrete location.

I fail to see how a user could and should know that and rather protect them from such details

* The sample value types and metadata types shouldn't be regarded as a closed set if we're still in favour of Bring Your Own Cell Group. Which I still am, to be honest.

Me too, but I am lacking the time and use case.

* Decoupling sampling plumbing from probe descriptions allows for the same recipe to be used with a variety or a subset of sampling schemes. It allows certain sorts of recipes (i.e. not in their full generality) to be represented in a 'flat' way, e.g. by text files. The same model of cells and connectivity can be used then in different contexts. It also allows for sampling to be changed at run time, e.g. by sampling more stuff more often if the user code sees something interesting.

True, I am still in favour of that idea. However, let's make it easy on our users (easier at least).

@thorstenhater
Copy link
Contributor Author

So, now things are looking a lot closer to what I want.

  1. There's a helper taking care of the casting and data mapping sample_reader.
  2. Value types can be inferred from the metadata's type.
  3. Metadata and payload are handed out as non-owning pointers, so things should be slightly more efficient (we are eliding one copy out of three.)
  4. Sampling data is now returned (basically!) as columns time and a DataFrame / table of #meta x #time.

TODO: documentation is outdated now.

@thorstenhater
Copy link
Contributor Author

Fixed all open points and cleaned up documentation.

@halfflat if you're still interested in this topic, please take a look, else I'll open it up for review.

@thorstenhater thorstenhater marked this pull request as ready for review September 1, 2025 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants