MAINT: get orsopy working in 3.12. Closes #115 #116

andyfaff · 2023-12-04T04:44:53Z

Adds a whole bunch of boilerplate in init functions, but would get orsopy working on 3.12.

aglavic · 2023-12-04T08:23:18Z

HI @andyfaff ,
coult you explain your idea behind this changes a bit. For me this looks like making the Header classes much harder to extend and adding a lot of code to all classes doing init routines that should actually be handled by the dataclass meta functionality.

What do we gain from using dataclass at all, if we need a custom init function for bascially every class?

Anyway, I'll have a look at it and will also check if it influences functionality anywhere.

andyfaff · 2023-12-04T09:32:05Z

Unfortunately the stdlib dataclasses (from lib/dataclasses.py) aren't designed to accept extra keyword arguments to create attributes that aren't listed on the class. e.g. one can't have HeaderDerivedClass(attr1, extra_attr=5), where attr1 is specified in the class definition, but extra_attr isn't. Because one of our use cases was to allow these unknown attributes we had modify how the dataclasses worked. This was done by creating the orsodataclass decorator.

orsodataclass required a knowledge of how dataclasses worked internally, together with the creation of _custom_init_fn and _create_fn, as well as several private attributes/functions from the dataclasses module. Unfortunately cp312 altered how some of these functions worked internally. Before cp312 various variables were created in our version of _init_fn using:

locals = {f"_type_{f.name}": f.type for f in fieldsarg}

However, in cp312 the stdlib _init_fn creates them like:

locals = {f'__dataclass_type_{f.name}__': f.type for f in fields}

This means that our _custom_init_fn works with cp<312, but not with cp312 because the rest of lib/dataclasses.py is incompatible with the variable naming scheme.

It may be possible to make a shim for this particular use case, but it's a fragile solution. I don't think we should be depending on our knowledge of the internals of dataclasses, or on private functions/attributes, because it's inherently subject to change outside our control. Every time a new Python version comes out we have to hope that nothing has changed (e.g. look at orsopy.dataclasses to see how one has to do various meddling for different Python versions. Speaking of which I could probably delete that file).

So, how is a solution reached? The first solution could be to vendor the lib/dataclasses.py file from stdlib. In that way we could keep our existing customised dataclass operation. However, I actually find the existing code quite hard to work with, because the monkey patching means you have to be familiar with lib/dataclasses.py as well as our overlay.

This PR takes a different route. Instead of using an orsodataclass decorator we just use the dataclass decorator directly. In order to provide the ability to provide the extra_attr mentioned above we have to write an __init__ method for each class.

e.g. for orsopy.fileio.base.Value we add:

    def __init__(self, magnitude, unit=None, error=None, *, comment=None, **kwds):
        super(Value, self).__init__()
        self.magnitude = magnitude
        self.unit = unit
        self.error = error
        for k, v in kwds.items():
            setattr(self, k, v)
        self.comment = comment
        self.__post_init__()

In this way we can add extra kwds without any issues. The stdlib does allow for dataclasses to provide their own __init__ methods, so operation is quite simple. One just has to make sure that one calls __post_init__ at the end of initialisation. IMO this approach is much simpler than the existing code. It's straightforward to extend/add extra attributes. We just add an extra parameter to the __init__ method, and specify the attribute in the class definition. The drawback is that it does add an amount of boilerplate initialisation.

What do we gain from using dataclass at all, if we need a custom init function for bascially every class?

It's basically the same __init__ method, but with different attribute names. The main reason we still benefit from using dataclasses is the validation carried out in Header.__post_init__ and Header._resolve_type. This is important when creating instances (no point in creating Value("qwerty") when it should be Value(1.0)) of all the Header classes. However, it's even more important when we de-serialise all the yaml. dataclasses can store all their types in their field information, and we use those types to figure out how to marshal the data into the different Header classes.

bmaranville · 2023-12-04T16:51:48Z

tools/header_schema.py

+"""
+** NOTE **
+This script is extremely sensitive to the version of pydantic that is installed.
+It works with pydantic-1.10.13 and lower, but not with pydantic >= 2.0


Yes, the bump to pydantic 2.0 is basically a complete rewrite of pydantic. They broke a bunch of things that used to make dataclasses similar in functionality to pydantic BaseModel. I think pydantic.dataclasses are not the top priority for them so they won't be brought back up to feature-parity with 1.x for a little while.

I spoke too soon... looks like they have made everything work really well! I hadn't tried it for a while. All we have to do is

from pydantic import TypeAdapter ... schema.update(TypeAdapter(Orso).json_schema())

bmaranville · 2023-12-04T17:35:22Z

What about this as the init function for Header, inherited by all the subclasses? It seems to work. Then you don't have to define an __init__ function for every class.

    def __init__(self, *args, **kwargs):
        self._orso_optionals = []
        own_fields = fields(self)
        own_field_names = [f.name for f in own_fields]
        field_index = 0
        for a in args:
            # field order is guaranteed
            setattr(self, own_field_names[field_index], a)
            field_index += 1
        for (k, v) in kwargs.items():
            setattr(self, k, v)
        if not hasattr(self, 'comment'):
            self.comment = kwargs.get('comment', None)

        self.__post_init__()

(I'm not sure if the special handling for comment is required)

bmaranville · 2023-12-04T20:13:29Z

Update: this __init__ function makes all the tests pass (after removing all the other init functions except class Orso and class Header:

class Header:
    """
    The super class for all of the items in the orso module.
    """
    _orso_optionals: List[str] = []

    def __init__(self, *args, **kwargs):
        self._orso_optionals = []
        own_fields = list(fields(self))
        for a in args:
            # field order is guaranteed
            field = own_fields.pop(0)
            setattr(self, field.name, a)
        # process the rest of own_fields, in order:
        for field in own_fields:
            if field.name in kwargs:
                field_value = kwargs.pop(field.name)
            elif field.default_factory is not MISSING:
                field_value = field.default_factory()
            elif field.default is not MISSING:
                field_value = field.default
            else:
                raise ValueError(f"missing positional argument \"{field.name}\" for class {self.__class__.__name__}")
            setattr(self, field.name, field_value)
        # process any remaining kwargs:
        for (k, v) in kwargs.items():
            setattr(self, k, v)

        self.__post_init__()

I am curious why extra kwargs are added as class attributes for every class except Orso? Why do we use the specially-created attribute user_data for that one class?

Attached is a diff, that if applied to this PR (in its current state) makes it pass all the tests.
inherit_init_patch.zip

andyfaff · 2023-12-04T23:18:09Z

Update: this init function makes all the tests pass

I tried the patch. There are pros and cons. On the one hand it makes all the classes derived from Header cleaner by removing all the boiler plate. However, it has an unfortunate side effect of making the help/docstring more opaque - the call signature gets replaced by *args, **kwargs instead of the actual signature that would be accepted.

>>> Person? 
Init signature: Person(*args, **kwargs)
Docstring:     
Information about a person, including name, affilation(s), and contact
information.
File:           ~/Documents/Andy/programming/orsopy/orsopy/fileio/base.py
Type:           type
Subclasses:

There would seem to be a tradeoff here. Personally I would prefer to see the correct signature than remove the boilerplate. What are your thoughts?
(I wonder if it'd be possible to write yet another decorator that would restore the function signature that would be displayed by help)

I am curious why extra kwargs are added as class attributes for every class except Orso? Why do we use the specially-created attribute user_data for that one class?

I can't remember why it's called user_data rather than anything else. It's used to include/absorb extra fields from the Header. I'd be wary about removing it though.

andyfaff · 2023-12-05T00:07:14Z

(I wonder if it'd be possible to write yet another decorator that would restore the function signature that would be displayed by help)

e.g.

def inherit_docstring_from(cls):
    pars = []
    for f in fields(cls):
        if not f.default:
            default = f.default
        else:
            default = Parameter.empty
        p = Parameter(f.name, Parameter.POSITIONAL_OR_KEYWORD, annotation=f.type, default=default)
        pars.append(p)

    doc = cls.__doc__
    cls.__doc__ = "Init signature:  " + cls.__name__ + str(Signature(pars)) + "\n" + doc
    return cls


@inherit_docstring_from
@dataclass(init=False)
class Pop(Header):
    """
    My test
    """
    _id: int
    unit: Optional[str] = field(default=None, metadata={"description": "SI unit string"})
    error: Optional[int] = None
    comment: Optional[str] = None

would give something like:

Init signature: Pop(*args, **kwargs)
Docstring:     
Init signature:  Pop(_id: int, unit: Optional[str] = None, error: Optional[int] = None, comment: Optional[str] = None)

My test
Type:           type
Subclasses:

which would be better than nothing

bmaranville · 2023-12-05T00:34:35Z

I agree, the lack of meaningful signatures is not great. This problem we're trying to solve, creating classes where core attributes are strongly typed and documented, but where extra attributes can be added (and serialized, and deserialized!), is hard.
I have to say it feels weird to tack on random attributes to a defined class here in almost 2024 - while it felt completely normal to me 10 years ago. I think the accepted practice on this is shifting with the mainstream use of type hints and dataclasses.

andyfaff · 2023-12-05T01:35:19Z

The advantage of being able to cope with extra attributes can be important in versioning orsopy. For example, we could add new attributes in a later orsopy. Files created by that newer version should still be loadable by an older version that copes with extra attributes.

I propose:

merge this PR as-is (if Artur agrees)
continue the discussion onto how/if to replace the boilerplate, and working around the docstring issue if we use the suggested patch.
make a concerted effort to complete Data order columns first #107. Artur wasn't sure about that one. I would like to get something along these lines completed so I can start contemplating ORSO files with detailed resolution information.
make a release.

aglavic · 2023-12-08T08:17:08Z

I would be against merging this PR, we should have an extended discussion about this as it is the basis for all orsopy classes. (And it might even break genx or other software the builds on orsopy by extending classes.)

I have a few main objections with this solution:

If we want people to adapt and extend the data format, it should be as simple as possible for them. They should not have to know about our library internals to be able to create a Header derived class. That is given if you can just derive from another class or use the same logic as when creating dataclasses but not the case if you need to write custom init functions whose logic you don't understand as an outsider.
The previous solution concentrated business logic to deal with these keywords in one place, now it is spread over all classes and might make as suceptible to bugs if something is missed in one of these classes. It also makes the code less readable.
Skipping the dataclass init removes some of the dataclass functionality and could again brake, if they change how dataclasses work. For me it also seems that we re-do the main functionality of dataclasses ourselfs, in which case we could directly remove that part and keep the whole logic contained in orsopy.

For me there are two alternative routs that might be more practicle:

Go for the metaclass solution that we had in discussion at the beginning of the project. This way the classes can just be derived withou any need for @DataClass etc.
Follow Andrews suggestion of just shipping orsopy with the original dataclass module from python 3.11 (maybe reneamed) and the modifications we need for orsopy. For me this is the easiest and cleanest solution, all internal code can stay the same and we probably don't want future changes in how types/dataclasses work in this project, anyway.

I'm sorry to slow you down at this point. I'm aware that this is just based on my opinion and if everyone disagrees you can obviously go ahead in this direction.
Unfortunately I didn't have much time to invest in the last couple of months and will be on vacation starting tomorrow. If you don't mind to wait so long, we could have a videocall end of January about this.

andyfaff · 2023-12-08T08:24:22Z

I hope you have a good holiday. Given that you'll be away before I can reply in full, would the genx test suite pick up any issues if we were to go down this path?

bmaranville · 2023-12-08T14:17:53Z

I think I agree with Artur that we should rethink our approach. It probably merits a longer discussion, when we're all back from the holidays. I think dataclasses provide some helpful features for defining a schema in python, but they are intentionally designed to not support arbitrary init kwargs, since it's kind of the point of a dataclass to define all your fields.

I also don't understand why downstream users couldn't inherit from Header subclasses and add their own fields, instead of tacking attrs to existing classes. For the migration problem (supporting future changes to the schema), I think we'd be much better served by developing migration functions, which is a fairly standard way to handle schema changes. The likelihood that future changes to the ORSO standard structure would be purely additive or subtractive of fields is pretty small, and I think that is the only type of schema change we could try to accomodate with arbitrary kwargs.

I know I am likely wrong about some of my concerns above, but my vote is to hold off on any refactor for now. There's not even e.g. a numba wheel for python 3.12 yet so I don't feel like there's a rush to have complete support for it at the moment.

bmaranville · 2023-12-12T14:19:25Z

What do you think of a solution like this: (found on stackoverflow)... instead of trying to force extra kwargs into dataclass init functions (which is hard and fragile), how about adding a new method "from_kwargs" for instantiating one of our classes with extra attributes? https://stackoverflow.com/a/55101438

We could use that function in the deserializer when reading .ort files, and then we wouldn't have to modify the __init__. The code for from_kwargs(**kw) would be much simpler, since we're not trying to do all the things dataclass initializers do, just run the built-in init and then add new kw.

aglavic · 2024-02-19T14:57:42Z

This could be a solution that could also be handled with inheritance for most cases without having to write a custom init for every single case. We will have to discuss how to treat the "comments" case. I'll have a look at the code a gain and will try some prototyping. I think after diving back into the code I can better weigh the various pros and cons, too.

@andyfaff @bmaranville @arm61
Can we find a date for a video call to discuss the strategy? For me the next weeks Mo/Tu look quite good.

aglavic · 2024-02-19T23:11:06Z

I've made a test implementation, have a look at #120

andyfaff · 2024-02-20T05:04:45Z

Hi Artur,
good to see you back online. I'll have a look at your suggestions in the next few days.

Alternative dataclass solution using from_dict method, would close Closes #114 #116

aglavic · 2024-02-23T08:36:36Z

Resolved by #120

andyfaff requested review from aglavic and bmaranville December 4, 2023 04:44

bmaranville reviewed Dec 4, 2023

View reviewed changes

andyfaff added 4 commits December 7, 2023 18:40

MAINT: get orsopy working in 3.12

da9ca23

MAINT: fix schema

2fc509e

MAINT: np.typing not available for cp36

f38d399

BUG: typo

b199e44

andyfaff force-pushed the redux branch from cb9a2cd to b199e44 Compare December 7, 2023 07:40

aglavic added a commit that referenced this pull request Feb 23, 2024

Merge pull request #120 from aglavic/from_dict

ef73600

Alternative dataclass solution using from_dict method, would close Closes #114 #116

aglavic closed this Feb 23, 2024

andyfaff mentioned this pull request Feb 26, 2024

Upgrade to Pydantic V2 #124

Closed

MAINT: get orsopy working in 3.12. Closes #115 #116

MAINT: get orsopy working in 3.12. Closes #115 #116

Uh oh!

Conversation

andyfaff commented Dec 4, 2023

Uh oh!

aglavic commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyfaff commented Dec 4, 2023

Uh oh!

bmaranville Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

bmaranville Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

bmaranville commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmaranville commented Dec 4, 2023

Uh oh!

andyfaff commented Dec 4, 2023

Uh oh!

andyfaff commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmaranville commented Dec 5, 2023

Uh oh!

andyfaff commented Dec 5, 2023

Uh oh!

aglavic commented Dec 8, 2023

Uh oh!

andyfaff commented Dec 8, 2023

Uh oh!

bmaranville commented Dec 8, 2023

Uh oh!

bmaranville commented Dec 12, 2023

Uh oh!

aglavic commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aglavic commented Feb 19, 2024

Uh oh!

andyfaff commented Feb 20, 2024

Uh oh!

aglavic commented Feb 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aglavic commented Dec 4, 2023 •

edited

Loading

bmaranville commented Dec 4, 2023 •

edited

Loading

andyfaff commented Dec 5, 2023 •

edited

Loading

aglavic commented Feb 19, 2024 •

edited

Loading