Incorporate `gempyor.distributions` into `gempyor.statistics` by emprzy · Pull Request #595 · HopkinsIDD/flepiMoP

emprzy · 2025-08-01T03:47:38Z

Describe your changes.

Incorporate the recently added distributions module into gempyor.statistics, and add a new ._likelihood() method to all Distributions that calculates the log likelihood of observed ground truth data, given the model data. It is worth noting that integrating this new log likelihood calculation method is not entirely seamless, as the previous implementation in gempyor.statistics (.llik()) used a dist_map to match distributions given in configs to their parameters. My changes leave us in a liminal state between these two, and provide if/else logic to allow backwards compatibility with dist_map, but I think it would be cleaner to move away from dist_map entirely.

Does this pull request make any user interface changes? If so please describe.

No user interface chagnes

What does your pull request address? Tag relevant issues.

This pull request addresses GH #583

emprzy · 2025-08-01T03:52:22Z

3 questions going into review:

How do we want to approach (if at all) migrating away from dist_map?
Do we want to try to use NumPy arrays instead of xarrays in gempyor.statistics, or would that change be too invasive? I saw there was a comment (presumably left by Joseph) that xarrays are too confusing. It's nice that xarrays have name dimensions, but I agree they are kinda confusing (but maybe I'm just a NumPy girl).
How do we want to go about unit testing ._likelihood()? I'm not sure of a good way besides individually testing input/expected values for each dist, since all the implementations are so different.

pearsonca

This is good start, but there are a few items to tack on.

flepimop/gempyor_pkg/src/gempyor/distributions.py

flepimop/gempyor_pkg/src/gempyor/statistics.py

TimothyWillard · 2025-08-01T18:25:31Z

How do we want to approach (if at all) migrating away from dist_map?

Some of these can be easily replaced by distributions already (like poisson & norm/norm_homoskedastic), others can be replaced by a new distribution (like norm_heteroskedastic similar to normal but the variance scales with the mean). However I do not know for sure what to do about rmse or absolute_error. This is a bit out of my expertise (@pearsonca or @MacdonaldJoshuaCaleb should weigh in here), but aren't rmse & absolute_error proportional to normal & laplace distribution log likelihoods, respectively? I also do not think those last two are used in practice.

Do we want to try to use NumPy arrays instead of xarrays in gempyor.statistics, or would that change be too invasive? I saw there was a comment (presumably left by Joseph) that xarrays are too confusing. It's nice that xarrays have name dimensions, but I agree they are kinda confusing (but maybe I'm just a NumPy girl).

I would say keep them as xarrays for now... Under the hood xarray uses numpy for its array representation so if you design these functions to work on numpy arrays xarray is smart enough to properly apply that to slices of a dataset/dataarray. I think changing this would be better done when we change the output representation from the simulator which starts with the return of gempyor.seir.steps_SEIR and then results in changes throughout gempyor.outcomes and finally here.

How do we want to go about unit testing ._likelihood()? I'm not sure of a good way besides individually testing input/expected values for each dist, since all the implementations are so different.

Yeah, that's a fair point and as we've discussed before we do not want to be in the business of testing 3rd party packages. I would say just limit unit testing to where we implement some custom behavior.

flepimop/gempyor_pkg/src/gempyor/distributions.py

TimothyWillard · 2025-08-01T18:48:30Z

flepimop/gempyor_pkg/src/gempyor/distributions.py

+    def _likelihood(self, gt_data: npt.NDArray, model_data: npt.NDArray) -> npt.NDArray:
+        """Log-likelihood for the normal distribution."""
+        return scipy.stats.norm.logpdf(x=gt_data, loc=model_data, scale=self.sigma)
+


It feels odd that you have to declare a mean to this distribution, but that doesn't get used. I'm pointing this out as an example, but I think pretty much all of these likelihoods suffer from this problem. For example, this is a truncated snippet from the simple_usa_statelevel.yml config:

inference: method: emcee iterations_per_slot: 100 do_inference: TRUE gt_data_path: model_input/data/generated_gt_data.csv statistics: incidCase: name: incidCase sim_var: incidCase data_var: incidCase resample: aggregator: sum freq: W-SAT skipna: True zero_to_one: True likelihood: dist: pois

this now would become:

inference: method: emcee iterations_per_slot: 100 do_inference: TRUE gt_data_path: model_input/data/generated_gt_data.csv statistics: incidCase: name: incidCase sim_var: incidCase data_var: incidCase resample: aggregator: sum freq: W-SAT skipna: True zero_to_one: True likelihood: distribution: pois lam: 1

You now have to add the lam and set it to a value so the pydantic model validation works, but it's totally unused by the likelihood function. I'm also not sure what the correct fix for this is though... thoughts @emprzy, @pearsonca?

i've been thinking about some adjacent things.

what if we make the parameters optional?

we could write sample to take ** (at the abstract level, specific args in the implementations), which defaults to None. uses the arguments if available, optional field values if not (and raises if created with no values and asked to sample with no args)

this would work nicely with likelihoods that don't want mean to be specified

So we make the parameters part of ** when you create an instance of a distribution? I like this, but let's take it to the logical extreme; what happens if people have a Distribution defined in their (janky) config with no parameters and then the .sample() method gets called, an error will come right? Are we ok with that possibility

pearsonca · 2025-08-01T19:00:55Z

Some of these can be easily replaced by distributions already (like poisson & norm/norm_homoskedastic), others can be replaced by a new distribution (like norm_heteroskedastic similar to normal but the variance scales with the mean). However I do not know for sure what to do about rmse or absolute_error. This is a bit out of my expertise (@pearsonca or @MacdonaldJoshuaCaleb should weigh in here), but aren't rmse & absolute_error proportional to normal & laplace distribution log likelihoods, respectively? I also do not think those last two are used in practice.

I think RMSE would be ...like a multinormal likelihood? Anywho, I think these are more cases of alternative norms / distance functions, rather than something to try to capture with a likelihood based approach. We want to support other kinds of norms - I think the better answer here is to separate when we're doing a likelihood based norm vs something else, rather than trying to shoehorn these particular ones into a distribution framing.

How do we want to go about unit testing ._likelihood()? I'm not sure of a good way besides individually testing input/expected values for each dist, since all the implementations are so different.

Yeah, that's a fair point and as we've discussed before we do not want to be in the business of testing 3rd party packages. I would say just limit unit testing to where we implement some custom behavior.

agree

Co-authored-by: Timothy Willard <9395586+TimothyWillard@users.noreply.github.com>

…ithub.com/HopkinsIDD/flepiMoP into incorporate-distributions-into-statistics

emprzy · 2025-08-07T07:29:19Z

@pearsonca @TimothyWillard @MacdonaldJoshuaCaleb See here an example implementation of the approach we discussed in slack, where there are two different BaseModels for the distributions – one for Sampling and the other for calculating Loglikelihood. Let me know what you think re: our two options right now:

Two distinct classes for the two distinct functionalities
Using one DistributionsABC for both functionalities, making parameter input optional

TimothyWillard · 2025-08-07T19:55:21Z

@pearsonca @TimothyWillard @MacdonaldJoshuaCaleb See here an example implementation of the approach we discussed in slack, where there are two different BaseModels for the distributions – one for Sampling and the other for calculating Loglikelihood. Let me know what you think re: our two options right now:

Two distinct classes for the two distinct functionalities

Using one DistributionsABC for both functionalities, making parameter input optional

Now seeing approach (1) implemented I like this a bit more. I think it also gives us more flexibility to tweak and refine likelihoods independent of the rest of the configuration (thinking of #86 for example). Some minor comments before implementation & review in this PR:

We don't have to keep the distribution terminology for this, we could call them likelihood terms, and also keep RMSE & absolute error, probably split them out into a different gempyor.likelihood module for maintainability.
I think the "_sample" and "_loglike" suffixes aren't needed since this are distinct concepts and users will know which one is being used depending on the configuration section being worked in.
I like the template method pattern that we're rolling with (log_likelihood being provided by LogLikelihoodTermABC calling _log_likelihood provided by implementations). I think this gives us a lot of flexibility to handle new features in the future.

Split distributions used for sampling and llik calculation into distinct modules.

pearsonca

Seems like progress, but some lingering bits from switch to split distribution vs likelihood. There are also some issues with the likelihood specifications, but where I think we can put off some implementations until they are actually needed.

pearsonca · 2025-08-22T13:19:16Z