Integrate concept normalization component to cnlpt#124
Integrate concept normalization component to cnlpt#124dongfang91 wants to merge 3 commits intomainfrom
Conversation
tmills
left a comment
There was a problem hiding this comment.
A few changes I'd like to see (and work together with @dongfang91 on).
The "task name" argument is now just a way of referring to a column in a data file, and should not be hardcoded in the data processing code. We no longer use task names to map to task types (classification, tagging, etc.), and now just infer them from the file format. So let's separate the conceptnorm task (a fine name for the column) from the task type, which can be generalized to something like cossim? (IIRC, the differentiating aspect of this task is a massive one-hot output space where we use cosine similarity layer instead of softmax)
We should come up with a data format that is unique to cossim and modify the cnlp_processors.py to infer that format correctly. In the existing formats we have a labeltext format, and the proposed format looks to invert that -- probably less confusing if we switch to match other tasks.
|
I agree. We could definitely infer the task types from the file format since the output label is always CUI (starting with a capital letter C and then followed by digits). But one thing I would be concerned is the amount of CUIs for the output space is more than the total amount of CUIs seen from the existing training data. Which means the input data should have a file to cover all the CUIs. If someone wants to use our code to train concept normalization models, four inputs would be required: training data, all the CUIs, giant embeddings matrix for those CUIS, and CUI-less threshold; if someone only uses our models for inference, then only all the CUIs are required. I assume the data format is used during the training? And it means this data format should be able to cover all four inputs. |
|
yes, I think that's why thinking of this as a brand new task type is important -- we can infer what type it is from the standard files, but once we realize it's a cossim type task we will know to look for that extra required file with the output space explicitly specified. Yes, the data format is mainly for training, but it could also be for --do_predict or --do_eval mode. |
|
To do for Dongfang:
|
integrate concept normalization components to this branch. A few places to check:
how to get the labels during the test
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/train_system.py#L702
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/train_system.py#L721
change event_tokens to event_mask
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/CnlpModelForClassification.py#L477
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/CnlpModelForClassification.py#L533