Integrate concept normalization component to cnlpt by dongfang91 · Pull Request #124 · Machine-Learning-for-Medical-Language/cnlp_transformers

dongfang91 · 2023-02-21T00:23:58Z

integrate concept normalization components to this branch. A few places to check:

how to get the labels during the test
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/train_system.py#L702
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/train_system.py#L721
change event_tokens to event_mask

https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/CnlpModelForClassification.py#L477
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/CnlpModelForClassification.py#L533

tmills

A few changes I'd like to see (and work together with @dongfang91 on).

The "task name" argument is now just a way of referring to a column in a data file, and should not be hardcoded in the data processing code. We no longer use task names to map to task types (classification, tagging, etc.), and now just infer them from the file format. So let's separate the conceptnorm task (a fine name for the column) from the task type, which can be generalized to something like cossim? (IIRC, the differentiating aspect of this task is a massive one-hot output space where we use cosine similarity layer instead of softmax)

We should come up with a data format that is unique to cossim and modify the cnlp_processors.py to infer that format correctly. In the existing formats we have a labeltext format, and the proposed format looks to invert that -- probably less confusing if we switch to match other tasks.

dongfang91 · 2023-03-28T04:13:47Z

I agree. We could definitely infer the task types from the file format since the output label is always CUI (starting with a capital letter C and then followed by digits). But one thing I would be concerned is the amount of CUIs for the output space is more than the total amount of CUIs seen from the existing training data. Which means the input data should have a file to cover all the CUIs. If someone wants to use our code to train concept normalization models, four inputs would be required: training data, all the CUIs, giant embeddings matrix for those CUIS, and CUI-less threshold; if someone only uses our models for inference, then only all the CUIs are required. I assume the data format is used during the training? And it means this data format should be able to cover all four inputs.

tmills · 2023-04-03T19:02:36Z

yes, I think that's why thinking of this as a brand new task type is important -- we can infer what type it is from the standard files, but once we realize it's a cossim type task we will know to look for that extra required file with the output space explicitly specified. Yes, the data format is mainly for training, but it could also be for --do_predict or --do_eval mode.

dongfang91 · 2023-06-07T16:43:59Z

To do for Dongfang:

Define consine similarity as a brand new task type when there is a large output space.
Infere the task type from the standard json files.
Add extra inputs file paths (pre-computed concept embeddings; CUI file) to the json files
Test for both training and evaluation; evaluation won't use the pre-computed concept embeddings.

Integrate concept normalization component to cnlpt

eaa925e

dongfang91 requested a review from tmills February 21, 2023 00:23

tmills added 2 commits March 8, 2023 16:30

Remove hard-coded paths and simplify running concept norm rest api.

285ffef

Change "conceptnorm" to skip_projection in projection layer.

e7f0101

tmills reviewed Mar 27, 2023

View reviewed changes

Base automatically changed from v0.5.0 to main April 4, 2023 17:28

tmills changed the base branch from main to v0.6.0 April 4, 2023 17:38

tmills changed the base branch from v0.6.0 to dev-v0.6.0 April 4, 2023 18:07

tmills changed the base branch from dev-v0.6.0 to dev-v0.7.0 August 3, 2023 21:28

Base automatically changed from dev-v0.7.0 to main November 1, 2024 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate concept normalization component to cnlpt#124

Integrate concept normalization component to cnlpt#124
dongfang91 wants to merge 3 commits intomainfrom
concept_norm_updates

dongfang91 commented Feb 21, 2023

Uh oh!

tmills left a comment

Uh oh!

dongfang91 commented Mar 28, 2023

Uh oh!

tmills commented Apr 3, 2023

Uh oh!

dongfang91 commented Jun 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dongfang91 commented Feb 21, 2023

Uh oh!

tmills left a comment

Choose a reason for hiding this comment

Uh oh!

dongfang91 commented Mar 28, 2023

Uh oh!

tmills commented Apr 3, 2023

Uh oh!

dongfang91 commented Jun 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants