Remove libNNPDF::LHAPDFSet for a vp-based class #1433

scarlehoff · 2021-10-13T07:49:50Z

Turns out it wasn't such a low-hanging fruit, but closes #1420

Some caveats (the reasons why this is still a draft):

There are a few very fundamental places where libNNPDF needs to still be used because there are libNNPDF objects asking for a PDF. I need to check whether I can find a workaround for some of them, at present I'm happy with my solution for those (as they also moved from libNNPDF to python it will be as easy as removing the as_libNNPDF() call.
grid_values in python is not as slow as I initially thought it will be. I just tried with the most naive loop-nesting and it was actually reasonable, but I need to actually benchmark it. If it happens to be really slow for most use cases I have a few solutions in mind that will work for us, but maybe it is not even necessary.

RE the current failure in the tests: is due to using float64 thought, when using vp-nextfitruncard there is a difference in the 4th digit. Reasonable. Maybe more failures will occur as I move forward though.

TODO:

Add an interface for alpha_s and xfxQ such that it can be easily interfaced with the python interface of pineappl

Despite being marked for review, it cannot be merged until the example resources are updated.

scarlehoff · 2021-10-14T08:49:28Z

The difference in speed is not relevant for most cases because I thought "well, 1 seconds vs 2 seconds, who cares" only this difference together with the difference in convolution adds up enough to be a pain. I'll see what I can do.

For starters the "I'll do the convolution and then the central value" every time is probably adding a factor of two, I'll do it such that the convolution is done just once (no more computing the central value separately), play a bit more with grid_values (I want to avoid cffi but it might be necessary, I think cffi is the easiest solution for this problem).

I will also greatly modify the core::PDF since now we have the lhapdf object in python I think part of the content of that file can be simplified.

scarlehoff · 2021-10-14T15:11:24Z

After a few benchmarks I'm actually happy with the speed of the grid_values here. It can be improved but the bulk of the problem is always either in the the fktable parsing (see #1091 (comment)) or the convolution itself.
I think cffi (here) is not worth it. I'll have a look at the convolution to see whether it can be sped up easily.

validphys2/src/validphys/lhapdfset.py

Zaharid · 2021-10-20T12:39:54Z

I wonder if we want to have this c++ compatibility layer here or instead remove the one remaining case which is the t0 predictions.

scarlehoff · 2021-10-20T12:43:42Z

I haven't looked at the t0 code so I don't know how many other things need to be changed as well for that to be "extracted" from C++.

Zaharid · 2021-10-20T17:09:51Z

I don't think there is that much to do in regards to the t0: we already have the t0covmat in pure python https://docs.nnpdf.science/vp/pydataobjs.html?highlight=dataset_inputs_covmat_from_systematics#loading-covariance-matrices

so it would be a matter of putting it in the right place.

That said I am coming to think doing it in two times might be good, on the grounds that one part is already done and would result in easier to review PRs.

validphys2/src/validphys/lhapdfset.py

scarlehoff · 2021-11-02T09:29:29Z

For the missing test I I think will need to re-do the "next exponent runcard" with this branch because of moving from float32 to float64.

Zaharid

Would like to comment on this.

siranipour · 2021-11-03T15:12:36Z

validphys2/src/validphys/results.py

+        # At this point only the result of the ThPredictions is a vp object
+        # which includes a special CV column which needs to be pop'd
+        try:
+            central_value = dataobj.pop("CV")


Maybe mutating the object like this will be dangerous/confusing

so this is probably due to my own limitations with pandas, so I'll write here "what I want" and let's see whether there's a way of doing this (/cc @Zaharid)

Right now when the predictions come from libNNPDF they are an object that contain a "central value" and "data" and these are two different quantities.

In vp instead the predictions are a just a pandas dataframe from which one cannot just get the cv because the central value for MC pdfs is not the same central value of hessian PDFs so my solution has been to add a new CV column with the "true" central value.

The problem is that I don't want that extra column to be part of the "data" of the prediction.

Is it clear / is there a solution to this issue?

An important constraint I need to include here is that I need the computation of the central value and the envelope to happen at the same time (because it is faster to do fktable x 101 rather than fkatble x 100 and later fktable x 1, so at some point the CV will have to be included in the dataframe).

The way I'm doing that is more or less clear in the example in class PDF.
Maybe (probably) the solution is to make a copy of the dataframe at this point so that it doesn't get mutated but it's separated into CV and the envelope of replicas.

Zaharid · 2021-11-10T14:16:23Z

Sorry I have been slow here.

In general I believe we should be doing something simpler. I would note that all this code is supposed to replace from the POV of vp is the grid_values function, with the corresponding .load() method, so in principle that is all the new PDF class needs to do.

For that, adding an interface with a stateful context manager looks a bit much. For one an explicit parameter to grid_values would be simpler (and it is really the caller who should decide whether they want that behaviour or not rather than the "context").
And I am not sure even that is needed: AFAICT insofar we want to have both "central value is replica 0" and "central value is the mean" behaviours (which I am still not too sure), that could be done with Stats objects that return the requisite views from the data. We could always query all replicas, which would probably simplify the understanding of what _rawdata is. In practical terms I believe this would mean removing the NNPDFResult classes in favour of StatsResult, which is the more reasonable API.

Another consideration is that grid_values is there because I wasn't going to write every possible loop in C, but we could as well have more specialized functions now.

scarlehoff · 2021-11-10T14:37:35Z

I think these are different problems however:

In general I believe we should be doing something simpler. I would note that all this code is supposed to replace from the POV of vp is the grid_values function, with the corresponding .load() method, so in principle that is all the new PDF class needs to do.

Sadly, for now it needs to be done like that since the libNNPDF and vp version have to coexist.

For that, adding an interface with a stateful context manager looks a bit much. For one an explicit parameter to grid_values would be simpler (and it is really the caller who should decide whether they want that behaviour or not rather than the "context").
And I am not sure even that is needed: AFAICT insofar we want to have both "central value is replica 0" and "central value is the mean" behaviours (which I am still not too sure), that could be done with Stats objects that return the requisite views from the data. We could always query all replicas, which would probably simplify the understanding of what _rawdata is. In practical terms I believe this would mean removing the NNPDFResult classes in favour of StatsResult, which is the more reasonable API.

I agree that the whole context manager since is horrible (not because of the context manager, but the mutability). But as I said, I'm not entirely sure how to do better without changing many different things, since:

The convolutions module creates dataframe with a prediction per replica
Replica 0 mean a different thing for MC and Hessian and, most importantly, in one of the two it is not to be considered.
Doing the calculation twice is not an option because it takes much longer (not a factor two, but adding the cv to the whole convolution is for free).
The only one who knows what to do with the CV is the PDF itself.

And, most importantly, the point at what grid_values is called is very hidden

nnpdf/validphys2/src/validphys/convolution.py

Line 293 in cd64fbc

gv1 = gv1func(qmat=[Q], vmat=FK_FLAVOURS, xmat=xgrid).squeeze(-1)

so either I have to add an argument include_central across many different levels or I use a context manager that activates the central value by any function calling the PDF inside the central value.

To me it was the simplest solution but I will be happy with any other ideas however they should be more or less concrete. I know this is not the best, but the Stats problem don't really solve it because you don't have anything to fill the stats with until you get out of the prediction.

We can discuss during the code meeting if this was not clear enough.

Another consideration is that grid_values is there because I wasn't going to write every possible loop in C, but we could as well have more specialized functions now.

This I didn't understand? I think grid_values is a function that makes sense since often we do want the grid of flavours-x-q-replicas.

Personally I'm not unhappy with the class LHAPDFSet since it contains everything that is useful and it is not very complicated. Maybe it can be simplified a bit more without the t0 part (like I can drop the libNNPDF error types). It's the central value problem I'm not happy with.

scarlehoff · 2021-11-10T18:23:22Z

So the final proposal is:

Port t0 code to python.
Include always replica 0 and when people don't want it they can use specific properties of the stat class to remove them

I will close this because it is probably easier to do a new one. Also I would want to do first t0 in python and then lhapdfset.

siranipour · 2021-11-12T10:31:28Z

Wait isn't the python t0 code here?

nnpdf/validphys2/src/validphys/covmats.py

Line 243 in 3519070

def t0_covmat_from_systematics(

scarlehoff · 2021-11-12T10:46:07Z

But libNNPDF is still used so not sure how many steps are missing from python (as said before, I haven't look at the cpp code for t0, same is still true :P)

siranipour · 2021-11-12T10:49:44Z

What part uses libNNPDF? AFAICT this is using the python CommonData objects as well as the covmat generation code in python. The only difference is it multiplies the MULT uncertainties with the t0 central prediction which is also computed using python convolution.

scarlehoff · 2021-11-12T10:53:23Z

Here for instance libNNPDF is used

nnpdf/validphys2/src/validphys/covmats.py

Line 488 in 3519070

if t0set:

siranipour · 2021-11-12T10:55:50Z

But these are the old functions right, in principle you don't need them

scarlehoff · 2021-11-12T10:57:19Z

I don't know whether they are used somewhere but they do exist. I can start by removing them, if nothing breaks that the easiest solution for me of course.

scarlehoff · 2021-11-22T11:40:31Z

But these are the old functions right, in principle you don't need them

@siranipour What do you mean by that? They are used and tested for. Maybe they are all "zombies" but I haven't used the internals of the vp covmat module to feel comfortable removing big chunks of code... specially if they make all test fail.

The "offending" functions are:

nnpdf/validphys2/src/validphys/covmats.py

Line 476 in d68954e

def dataset_inputs_covmat(

nnpdf/validphys2/src/validphys/covmats.py

Line 510 in d68954e

def covmat(

nnpdf/validphys2/src/validphys/closuretest/multiclosure.py

Line 36 in d68954e

def internal_multiclosure_dataset_loader(

(the last one given the TODO on top I guess can be removed)

scarlehoff added the destroyingc++ label Oct 13, 2021

scarlehoff marked this pull request as draft October 13, 2021 07:49

scarlehoff mentioned this pull request Oct 14, 2021

Change ThPredictions to python predictions #1430

Merged

scarlehoff force-pushed the removing_cpp_LHAPDFSet_python branch from 1bdd7d8 to 2c58d49 Compare October 14, 2021 10:44

scarlehoff marked this pull request as ready for review October 14, 2021 15:48

Zaharid reviewed Oct 19, 2021

View reviewed changes