Skip to content

Conversation

@scarlehoff
Copy link
Member

@scarlehoff scarlehoff commented Oct 13, 2021

Turns out it wasn't such a low-hanging fruit, but closes #1420

Some caveats (the reasons why this is still a draft):

  • There are a few very fundamental places where libNNPDF needs to still be used because there are libNNPDF objects asking for a PDF. I need to check whether I can find a workaround for some of them, at present I'm happy with my solution for those (as they also moved from libNNPDF to python it will be as easy as removing the as_libNNPDF() call.
  • grid_values in python is not as slow as I initially thought it will be. I just tried with the most naive loop-nesting and it was actually reasonable, but I need to actually benchmark it. If it happens to be really slow for most use cases I have a few solutions in mind that will work for us, but maybe it is not even necessary.

RE the current failure in the tests: is due to using float64 thought, when using vp-nextfitruncard there is a difference in the 4th digit. Reasonable. Maybe more failures will occur as I move forward though.

TODO:

  • Add an interface for alpha_s and xfxQ such that it can be easily interfaced with the python interface of pineappl

Despite being marked for review, it cannot be merged until the example resources are updated.

@scarlehoff
Copy link
Member Author

scarlehoff commented Oct 14, 2021

The difference in speed is not relevant for most cases because I thought "well, 1 seconds vs 2 seconds, who cares" only this difference together with the difference in convolution adds up enough to be a pain. I'll see what I can do.

For starters the "I'll do the convolution and then the central value" every time is probably adding a factor of two, I'll do it such that the convolution is done just once (no more computing the central value separately), play a bit more with grid_values (I want to avoid cffi but it might be necessary, I think cffi is the easiest solution for this problem).

I will also greatly modify the core::PDF since now we have the lhapdf object in python I think part of the content of that file can be simplified.

@scarlehoff scarlehoff force-pushed the removing_cpp_LHAPDFSet_python branch from 1bdd7d8 to 2c58d49 Compare October 14, 2021 10:44
@scarlehoff
Copy link
Member Author

After a few benchmarks I'm actually happy with the speed of the grid_values here. It can be improved but the bulk of the problem is always either in the the fktable parsing (see #1091 (comment)) or the convolution itself.
I think cffi (here) is not worth it. I'll have a look at the convolution to see whether it can be sped up easily.

@scarlehoff scarlehoff marked this pull request as ready for review October 14, 2021 15:48
@Zaharid
Copy link
Contributor

Zaharid commented Oct 20, 2021

I wonder if we want to have this c++ compatibility layer here or instead remove the one remaining case which is the t0 predictions.

@scarlehoff
Copy link
Member Author

I haven't looked at the t0 code so I don't know how many other things need to be changed as well for that to be "extracted" from C++.

@Zaharid
Copy link
Contributor

Zaharid commented Oct 20, 2021

I don't think there is that much to do in regards to the t0: we already have the t0covmat in pure python https://docs.nnpdf.science/vp/pydataobjs.html?highlight=dataset_inputs_covmat_from_systematics#loading-covariance-matrices

so it would be a matter of putting it in the right place.

That said I am coming to think doing it in two times might be good, on the grounds that one part is already done and would result in easier to review PRs.

@scarlehoff scarlehoff changed the title Remove libNNPDF::LHAPDFSet Remove libNNPDF::LHAPDFSet for a vp-based class Oct 28, 2021
@scarlehoff scarlehoff linked an issue Oct 28, 2021 that may be closed by this pull request
@scarlehoff scarlehoff force-pushed the removing_cpp_thpredictions branch from 4e8c31f to 0bad6c1 Compare November 1, 2021 20:43
@scarlehoff scarlehoff force-pushed the removing_cpp_LHAPDFSet_python branch from fa7913a to b872fda Compare November 2, 2021 09:28
@scarlehoff
Copy link
Member Author

For the missing test I I think will need to re-do the "next exponent runcard" with this branch because of moving from float32 to float64.

@Zaharid Zaharid added the enhancement New feature or request label Nov 3, 2021
Copy link
Contributor

@Zaharid Zaharid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to comment on this.

# At this point only the result of the ThPredictions is a vp object
# which includes a special CV column which needs to be pop'd
try:
central_value = dataobj.pop("CV")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mutating the object like this will be dangerous/confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is probably due to my own limitations with pandas, so I'll write here "what I want" and let's see whether there's a way of doing this (/cc @Zaharid)

Right now when the predictions come from libNNPDF they are an object that contain a "central value" and "data" and these are two different quantities.

In vp instead the predictions are a just a pandas dataframe from which one cannot just get the cv because the central value for MC pdfs is not the same central value of hessian PDFs so my solution has been to add a new CV column with the "true" central value.

The problem is that I don't want that extra column to be part of the "data" of the prediction.

Is it clear / is there a solution to this issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important constraint I need to include here is that I need the computation of the central value and the envelope to happen at the same time (because it is faster to do fktable x 101 rather than fkatble x 100 and later fktable x 1, so at some point the CV will have to be included in the dataframe).

The way I'm doing that is more or less clear in the example in class PDF.
Maybe (probably) the solution is to make a copy of the dataframe at this point so that it doesn't get mutated but it's separated into CV and the envelope of replicas.

@Zaharid
Copy link
Contributor

Zaharid commented Nov 10, 2021

Sorry I have been slow here.

In general I believe we should be doing something simpler. I would note that all this code is supposed to replace from the POV of vp is the grid_values function, with the corresponding .load() method, so in principle that is all the new PDF class needs to do.

For that, adding an interface with a stateful context manager looks a bit much. For one an explicit parameter to grid_values would be simpler (and it is really the caller who should decide whether they want that behaviour or not rather than the "context").
And I am not sure even that is needed: AFAICT insofar we want to have both "central value is replica 0" and "central value is the mean" behaviours (which I am still not too sure), that could be done with Stats objects that return the requisite views from the data. We could always query all replicas, which would probably simplify the understanding of what _rawdata is. In practical terms I believe this would mean removing the NNPDFResult classes in favour of StatsResult, which is the more reasonable API.

Another consideration is that grid_values is there because I wasn't going to write every possible loop in C, but we could as well have more specialized functions now.

@scarlehoff
Copy link
Member Author

I think these are different problems however:

In general I believe we should be doing something simpler. I would note that all this code is supposed to replace from the POV of vp is the grid_values function, with the corresponding .load() method, so in principle that is all the new PDF class needs to do.

Sadly, for now it needs to be done like that since the libNNPDF and vp version have to coexist.

For that, adding an interface with a stateful context manager looks a bit much. For one an explicit parameter to grid_values would be simpler (and it is really the caller who should decide whether they want that behaviour or not rather than the "context").
And I am not sure even that is needed: AFAICT insofar we want to have both "central value is replica 0" and "central value is the mean" behaviours (which I am still not too sure), that could be done with Stats objects that return the requisite views from the data. We could always query all replicas, which would probably simplify the understanding of what _rawdata is. In practical terms I believe this would mean removing the NNPDFResult classes in favour of StatsResult, which is the more reasonable API.

I agree that the whole context manager since is horrible (not because of the context manager, but the mutability). But as I said, I'm not entirely sure how to do better without changing many different things, since:

  • The convolutions module creates dataframe with a prediction per replica
  • Replica 0 mean a different thing for MC and Hessian and, most importantly, in one of the two it is not to be considered.
  • Doing the calculation twice is not an option because it takes much longer (not a factor two, but adding the cv to the whole convolution is for free).
  • The only one who knows what to do with the CV is the PDF itself.

And, most importantly, the point at what grid_values is called is very hidden

gv1 = gv1func(qmat=[Q], vmat=FK_FLAVOURS, xmat=xgrid).squeeze(-1)
so either I have to add an argument include_central across many different levels or I use a context manager that activates the central value by any function calling the PDF inside the central value.

To me it was the simplest solution but I will be happy with any other ideas however they should be more or less concrete. I know this is not the best, but the Stats problem don't really solve it because you don't have anything to fill the stats with until you get out of the prediction.

We can discuss during the code meeting if this was not clear enough.

Another consideration is that grid_values is there because I wasn't going to write every possible loop in C, but we could as well have more specialized functions now.

This I didn't understand? I think grid_values is a function that makes sense since often we do want the grid of flavours-x-q-replicas.

Personally I'm not unhappy with the class LHAPDFSet since it contains everything that is useful and it is not very complicated. Maybe it can be simplified a bit more without the t0 part (like I can drop the libNNPDF error types). It's the central value problem I'm not happy with.

@scarlehoff
Copy link
Member Author

So the final proposal is:

  1. Port t0 code to python.
  2. Include always replica 0 and when people don't want it they can use specific properties of the stat class to remove them

I will close this because it is probably easier to do a new one. Also I would want to do first t0 in python and then lhapdfset.

@scarlehoff scarlehoff closed this Nov 10, 2021
@siranipour
Copy link
Contributor

Wait isn't the python t0 code here?

def t0_covmat_from_systematics(

@scarlehoff
Copy link
Member Author

But libNNPDF is still used so not sure how many steps are missing from python (as said before, I haven't look at the cpp code for t0, same is still true :P)

@siranipour
Copy link
Contributor

What part uses libNNPDF? AFAICT this is using the python CommonData objects as well as the covmat generation code in python. The only difference is it multiplies the MULT uncertainties with the t0 central prediction which is also computed using python convolution.

@scarlehoff
Copy link
Member Author

Here for instance libNNPDF is used

@siranipour
Copy link
Contributor

But these are the old functions right, in principle you don't need them

@scarlehoff
Copy link
Member Author

I don't know whether they are used somewhere but they do exist. I can start by removing them, if nothing breaks that the easiest solution for me of course.

@scarlehoff
Copy link
Member Author

But these are the old functions right, in principle you don't need them

@siranipour What do you mean by that? They are used and tested for. Maybe they are all "zombies" but I haven't used the internals of the vp covmat module to feel comfortable removing big chunks of code... specially if they make all test fail.

The "offending" functions are:

def dataset_inputs_covmat(

def internal_multiclosure_dataset_loader(

(the last one given the TODO on top I guess can be removed)

@scarlehoff scarlehoff deleted the removing_cpp_LHAPDFSet_python branch March 5, 2025 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

destroyingc++ enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change LHAPDFSet to a python interface

4 participants