Skip to content

Incorporate gempyor.distributions into gempyor.statistics#595

Open
emprzy wants to merge 18 commits intodevfrom
incorporate-distributions-into-statistics
Open

Incorporate gempyor.distributions into gempyor.statistics#595
emprzy wants to merge 18 commits intodevfrom
incorporate-distributions-into-statistics

Conversation

@emprzy
Copy link
Collaborator

@emprzy emprzy commented Aug 1, 2025

Describe your changes.

Incorporate the recently added distributions module into gempyor.statistics, and add a new ._likelihood() method to all Distributions that calculates the log likelihood of observed ground truth data, given the model data. It is worth noting that integrating this new log likelihood calculation method is not entirely seamless, as the previous implementation in gempyor.statistics (.llik()) used a dist_map to match distributions given in configs to their parameters. My changes leave us in a liminal state between these two, and provide if/else logic to allow backwards compatibility with dist_map, but I think it would be cleaner to move away from dist_map entirely.

Does this pull request make any user interface changes? If so please describe.

No user interface chagnes

What does your pull request address? Tag relevant issues.

This pull request addresses GH #583

@emprzy
Copy link
Collaborator Author

emprzy commented Aug 1, 2025

3 questions going into review:

  1. How do we want to approach (if at all) migrating away from dist_map?
  2. Do we want to try to use NumPy arrays instead of xarrays in gempyor.statistics, or would that change be too invasive? I saw there was a comment (presumably left by Joseph) that xarrays are too confusing. It's nice that xarrays have name dimensions, but I agree they are kinda confusing (but maybe I'm just a NumPy girl).
  3. How do we want to go about unit testing ._likelihood()? I'm not sure of a good way besides individually testing input/expected values for each dist, since all the implementations are so different.

@emprzy emprzy self-assigned this Aug 1, 2025
@emprzy emprzy added enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. high priority High priority. inference Concerns the parameter inference framework. labels Aug 1, 2025
Copy link
Contributor

@pearsonca pearsonca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good start, but there are a few items to tack on.

@TimothyWillard
Copy link
Contributor

  1. How do we want to approach (if at all) migrating away from dist_map?

Some of these can be easily replaced by distributions already (like poisson & norm/norm_homoskedastic), others can be replaced by a new distribution (like norm_heteroskedastic similar to normal but the variance scales with the mean). However I do not know for sure what to do about rmse or absolute_error. This is a bit out of my expertise (@pearsonca or @MacdonaldJoshuaCaleb should weigh in here), but aren't rmse & absolute_error proportional to normal & laplace distribution log likelihoods, respectively? I also do not think those last two are used in practice.

  1. Do we want to try to use NumPy arrays instead of xarrays in gempyor.statistics, or would that change be too invasive? I saw there was a comment (presumably left by Joseph) that xarrays are too confusing. It's nice that xarrays have name dimensions, but I agree they are kinda confusing (but maybe I'm just a NumPy girl).

I would say keep them as xarrays for now... Under the hood xarray uses numpy for its array representation so if you design these functions to work on numpy arrays xarray is smart enough to properly apply that to slices of a dataset/dataarray. I think changing this would be better done when we change the output representation from the simulator which starts with the return of gempyor.seir.steps_SEIR and then results in changes throughout gempyor.outcomes and finally here.

  1. How do we want to go about unit testing ._likelihood()? I'm not sure of a good way besides individually testing input/expected values for each dist, since all the implementations are so different.

Yeah, that's a fair point and as we've discussed before we do not want to be in the business of testing 3rd party packages. I would say just limit unit testing to where we implement some custom behavior.

Comment on lines 152 to 155
def _likelihood(self, gt_data: npt.NDArray, model_data: npt.NDArray) -> npt.NDArray:
"""Log-likelihood for the normal distribution."""
return scipy.stats.norm.logpdf(x=gt_data, loc=model_data, scale=self.sigma)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels odd that you have to declare a mean to this distribution, but that doesn't get used. I'm pointing this out as an example, but I think pretty much all of these likelihoods suffer from this problem. For example, this is a truncated snippet from the simple_usa_statelevel.yml config:

inference:
  method: emcee
  iterations_per_slot: 100
  do_inference: TRUE
  gt_data_path: model_input/data/generated_gt_data.csv
  statistics:
    incidCase:
      name: incidCase
      sim_var: incidCase
      data_var: incidCase
      resample:
        aggregator: sum
        freq: W-SAT
        skipna: True
      zero_to_one: True
      likelihood: 
        dist: pois

this now would become:

inference:
  method: emcee
  iterations_per_slot: 100
  do_inference: TRUE
  gt_data_path: model_input/data/generated_gt_data.csv
  statistics:
    incidCase:
      name: incidCase
      sim_var: incidCase
      data_var: incidCase
      resample:
        aggregator: sum
        freq: W-SAT
        skipna: True
      zero_to_one: True
      likelihood: 
        distribution: pois
        lam: 1

You now have to add the lam and set it to a value so the pydantic model validation works, but it's totally unused by the likelihood function. I'm also not sure what the correct fix for this is though... thoughts @emprzy, @pearsonca?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've been thinking about some adjacent things.

what if we make the parameters optional?

we could write sample to take ** (at the abstract level, specific args in the implementations), which defaults to None. uses the arguments if available, optional field values if not (and raises if created with no values and asked to sample with no args)

this would work nicely with likelihoods that don't want mean to be specified

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we make the parameters part of ** when you create an instance of a distribution? I like this, but let's take it to the logical extreme; what happens if people have a Distribution defined in their (janky) config with no parameters and then the .sample() method gets called, an error will come right? Are we ok with that possibility

@pearsonca
Copy link
Contributor

Some of these can be easily replaced by distributions already (like poisson & norm/norm_homoskedastic), others can be replaced by a new distribution (like norm_heteroskedastic similar to normal but the variance scales with the mean). However I do not know for sure what to do about rmse or absolute_error. This is a bit out of my expertise (@pearsonca or @MacdonaldJoshuaCaleb should weigh in here), but aren't rmse & absolute_error proportional to normal & laplace distribution log likelihoods, respectively? I also do not think those last two are used in practice.

I think RMSE would be ...like a multinormal likelihood? Anywho, I think these are more cases of alternative norms / distance functions, rather than something to try to capture with a likelihood based approach. We want to support other kinds of norms - I think the better answer here is to separate when we're doing a likelihood based norm vs something else, rather than trying to shoehorn these particular ones into a distribution framing.

  1. How do we want to go about unit testing ._likelihood()? I'm not sure of a good way besides individually testing input/expected values for each dist, since all the implementations are so different.

Yeah, that's a fair point and as we've discussed before we do not want to be in the business of testing 3rd party packages. I would say just limit unit testing to where we implement some custom behavior.

agree

emprzy and others added 3 commits August 1, 2025 15:32
@emprzy
Copy link
Collaborator Author

emprzy commented Aug 7, 2025

@pearsonca @TimothyWillard @MacdonaldJoshuaCaleb See here an example implementation of the approach we discussed in slack, where there are two different BaseModels for the distributions – one for Sampling and the other for calculating Loglikelihood. Let me know what you think re: our two options right now:

  1. Two distinct classes for the two distinct functionalities
  2. Using one DistributionsABC for both functionalities, making parameter input optional

@TimothyWillard
Copy link
Contributor

@pearsonca @TimothyWillard @MacdonaldJoshuaCaleb See here an example implementation of the approach we discussed in slack, where there are two different BaseModels for the distributions – one for Sampling and the other for calculating Loglikelihood. Let me know what you think re: our two options right now:

  1. Two distinct classes for the two distinct functionalities
  2. Using one DistributionsABC for both functionalities, making parameter input optional

Now seeing approach (1) implemented I like this a bit more. I think it also gives us more flexibility to tweak and refine likelihoods independent of the rest of the configuration (thinking of #86 for example). Some minor comments before implementation & review in this PR:

  1. We don't have to keep the distribution terminology for this, we could call them likelihood terms, and also keep RMSE & absolute error, probably split them out into a different gempyor.likelihood module for maintainability.
  2. I think the "_sample" and "_loglike" suffixes aren't needed since this are distinct concepts and users will know which one is being used depending on the configuration section being worked in.
  3. I like the template method pattern that we're rolling with (log_likelihood being provided by LogLikelihoodTermABC calling _log_likelihood provided by implementations). I think this gives us a lot of flexibility to handle new features in the future.

Copy link
Contributor

@pearsonca pearsonca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like progress, but some lingering bits from switch to split distribution vs likelihood. There are also some issues with the likelihood specifications, but where I think we can put off some implementations until they are actually needed.

Comment on lines 45 to 54
"""
Calculates the log-likelihood of observing data given the model's predictions.

Args:
gt_data: The observed ground truth data.
model_data: The data produced by flepiMoP.

Returns:
An array of log-likelihood values.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatever the proper docstring syntax here, let's have an details noting that:

  • gt_data, model_data must be the same shape (leave aside for the moment any actual enforcement of this, however)
  • the caller is responsible for ensuring that gt_data and model_data align (i.e. this does nothing to deal with observation times versus model time)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, though do we think there would be failures earlier than this code if gt_data and model_data do not have matching dimensions? Users aren't going to interact with the ._likelihood() method directly .


class TruncatedNormalLoglikelihood(LoglikelihoodABC):
"""
Represents a truncated normal distribution for calculating log-likelihood.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remind people that the interpretation of a, b here is different from what's going on w/ trunc normal sampling.

In sampling, a and b are the absolute sample limits. Here they should be the relative-offsets-to-mu sample limits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, actually I think there's a lot more going on here with how we want to allow a truncated normal likelihood that's going to take some mathematical care.

for now, we shouldn't support this. we can revisit it if people request it in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with this take, and feel the same way about uniform for llik. let me know if you disagree @TimothyWillard @pearsonca

)


class AbsoluteErrorLoglikelihood(LoglikelihoodABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abs Error isn't a likelihood - we need an additional layer of distance or norm or target evaluation here - so like TargetABC => LikelihoodABC => [Distributions]ABC. But also TargetABC => AbsoluteError. Ibid for RMSE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely hear where you are coming from, though I think a better solution to this dilemma would not to create more parent ABCs, but to create a better name for the LogLikelihoodABC class that is more inclusive of these two "flavors" of error calculation. Like, just renaming LogLikelihoodABC to ErrorMetricABC, MetricABC, ObjectiveABC, or something like that? Keeps things simple and clean. @pearsonca

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectiveFunctionABC ...? Any takers...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - aren't we repeating the test-them-all error here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same framework we used for testing gempyor.distributions (one test file per distribution). or is there something specific about this file that you think is repetitive..?

Change `gempyor.likelihoods` to be `gempyor.objective_functions`
@emprzy
Copy link
Collaborator Author

emprzy commented Aug 28, 2025

In my most recent commit (32fc04c) I implement some of the minute suggestions from @pearsonca 's last review, and I also establish a new naming convention for what was previously called gempyor.likelihoods to hopefully be more inclusive of RMSE and AbsErr without having to create more parent ABCs. Let me know what y'all think of that implementation.
@pearsonca @TimothyWillard

I'll update things like tests and docstring documentation once I get the green light on this!

Copy link
Contributor

@pearsonca pearsonca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do some closer look at some of the likelihood calculations, but perhaps something to work with ahead of that.

@emprzy
Copy link
Collaborator Author

emprzy commented Sep 10, 2025

Only thing left here is for @pearsonca to review whether or not scipy is doing what we want!

@emprzy
Copy link
Collaborator Author

emprzy commented Sep 16, 2025

Note that we have had to disable RMSE and AbsoluteError as objective functions due to the fact that gempyor.statistics::llik() is not presently able to handle scalar returns from an error metric calculation. Created #612 to address

@TimothyWillard TimothyWillard removed their request for review January 28, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. high priority High priority. inference Concerns the parameter inference framework.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature request]: Incorporate gempyor.distributions into gempyor.statistics

3 participants