Skip to content

Make column names valid Python identifiers #21

@rmnldwg

Description

@rmnldwg

Since we need a three level header in the lyDATA columns to represent (1) diagnostic modality, (2) side, and (3) LNL in the involvement table, every other column e.g. about age or HPV status also has three levels. To keep it short, I thoght it was a good idea to use the filler # as the second level header for patient information.

However, that turned out to be a stupid idea, since pandas allows accessing multi-level headers using dot-notation if (and only if) the header names are valid Python identifiers. E.g., dataframe.patient.info.age would be a valid way of accessing the column ("patient", "info", "age"). But obviously dataframe.patient.#.age is NOT valid Python code.

Similarly, I thought is was maybe useful to allow multiple synchronous tumors by enumerating the tumors a patient has in the second level header. So, a patient could have info about one tumor under ("tumor", "1", "t_stage") and about another under ("tumor", "2", "t_stage"). But this too was stupid: Few patients have synchronous tumors and most of them are somewhat extraordinary and excluded from most studies. So, all this scheme did was again to create column headers that are NOT valid Python identifiers.

Therefore, I propose to change the second level heading for both the "patient" and the "tumor" top-level headers to a simple underscore. Then we could use dataframe.patient._.age and dataframe.tumor._.t_stage. This also would open the door for nice type hinting and auto-completion in an IDE like VS Code.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions