Integrate datamodules #141

wtt102 · 2022-11-06T19:08:54Z

Completed

Regression now using map style datasets
Single datamodule in datamodules.py that gets instantiated within each regression lightning module
All unit tests run expect for markov (not converging); currently troubleshooting this.
Same results + speed between dataloader and datamodule methods

In progress / needed

Fix markov not converging
Get notmad to also instantiate its own datamodule (should be straightforward, and this looks pretty accessible given the current state I'm seeing)
Possibly abstract or move datamodules from notmad and regression to a single file (thinking keeping these separate may be less confusing)
Proofread documentation

Bugfixes for easy modules predict_params, added early stopping kwargs to easy modules.

update package version

Release v0.2.4

cnellington

Overall looks good for this update. Minor comments on style and logic. The tests aren't passing for Markov datamodules because we're testing for convergence on a random X (i.e. we shouldn't expect generalization to held-out sets because there's no function to learn). You might concatenate X and Y features into a single matrix in the markov _quicktests to fix this. So long as we don't see terrible overfitting on the X -> X regression, the X -> Y regression should converge.

contextualized/regression/__init__.py

contextualized/regression/datasets.py

contextualized/regression/tests.py

contextualized/regression/lightning_modules.py

contextualized/regression/datasets.py

contextualized/regression/lightning_modules.py

cnellington

Great work! Some overall comments:

It seems like there's a major bug in the dataloader indexing that we both overlooked in the last review. Given the tricky logic issues we've seen from trying to make the dataloaders index-based, we need some dataset tests to verify that the dataloaders actually contain all the data we think they do. I'd suggest creating tests where you try to reconstruct the input matrices from the dataloaders. This reversed construction should reveal any bugs we have with indexing.

contextualized/regression/__init__.py

cnellington · 2022-12-16T15:32:32Z

contextualized/regression/datasets.py


-    def __iter__(self):
+        if idx == []:  # use full dataset
+            self.idx = range(self.n)


I'm confused about idx. It's used to select from all_samples (which, in the multitask dataset is length n * y_dim, and in the univariate multitask dataset is n * y_dim * x_dim), but is set by default to a range of n. This is almost certainly causing a bug in the multitask datasets right now, where we select n entries from all_samples to place into samples.

The name of the kwarg makes me think this should select samples from C, X, and Y before the all_samples transformation.

If it is used, idx should be set to index the whole dataset as the default kwarg. Currently it is set to [] by default and reset immediately.

Thanks for pointing this out. Before "_generate_sample_ids" was abstracted, each dataset had its own variation of the line self.sample_ids = [self.all_sample_ids[i] for i in self.idx]. For instance something like self.sample_ids = [s for s in self.all_samples if s[1] in self.idx] (in the case where length of total samples is n * y_dim, and y_dim=3). This made it easier at the time to have randomization work, because generating random indeces for test/train/val was happening at the datamodule level, but iterating to create additional samples (eg, y_dim * n) was happening at the dataset level. I forgot about this when doing the abstracting. I realize this can easily be replaced by specifying the dataset type in the datamodule and using if statements to create the indeces of the full range beforehand.

I think the multi-task indexing logic should only live in the actual datasets (not the datamodule). That way we only have to think about sample-level indexing outside of the datasets.

I think it might make more sense to trim the input C, X, Y matrices according to the sample idx in the baseclass __init__, and then create the randomized multi-task sample list using these trimmed matrices and the subclass-specific __len__.

I think this still needs attention. idx still needs to select samples before self.samples is generated. Also let's use None as a default "unspecified" kwarg since [] isn't adding any important functionality.

cnellington · 2022-12-16T15:35:38Z

contextualized/regression/datasets.py

+
+        self._generate_samples()
+
+    def _generate_samples(self):


Is this used outside of __iter__? If not, this should all be contained in __iter__.

Might need to be moved to __init__ now.

cnellington · 2022-12-16T15:37:19Z

contextualized/regression/datamodules.py

+        self.val_dataset = self.dataset(self.C, self.X, self.Y, idx=self.val_idx)
+        # self.pred_dataset = self.dataset(self.C, self.X, self.Y, idx=self.pred_idx)
+
+    def setup(self, stage: str):


Unclear what should be happening here, this method does nothing

In the example code that torch provides, setup is one of the functions that I think they mention is included in order to have operations specific to each of the different runs. Not sure if this would be useful or not in the future for our purposes.

I see. Sounds like we should use this to control which dataloader is being referenced for loading samples, which would address some of the other comments.

Sounds like a great idea. In reference to your other comment, pytorch does seem to require a predict_dataloader, so I have the “setup” function now being used to swap which of the three datasets is used by predict_dataloader. I’ve so far left out stages “train/test/val” from the setup function, as they aren’t needed yet and to minimize holding onto excess variables. Currently troubleshooting some bugs, but I’ll let you know when I can update the repo to reflect these changes.

cnellington · 2022-12-16T15:38:47Z

contextualized/regression/datamodules.py

+        self.train_dataset = self.dataset(self.C, self.X, self.Y, idx=self.train_idx)
+        self.test_dataset = self.dataset(self.C, self.X, self.Y, idx=self.test_idx)
+        self.val_dataset = self.dataset(self.C, self.X, self.Y, idx=self.val_idx)
+        # self.pred_dataset = self.dataset(self.C, self.X, self.Y, idx=self.pred_idx)


Do we need a pred dataset? Seems like we would only ever need full, train, test, and val.

contextualized/regression/datasets.py

contextualized/regression/lightning_modules.py

contextualized/regression/metamodels.py

cnellington · 2022-12-16T16:23:53Z

contextualized/regression/trainers.py

    """

-    def predict_params(self, model, dataloader):
+    def predict_params(self, model, dataclass):


This might be a good place to specify what set in the datamodule we're trying to predict from.

cnellington

Great work! I think once these comments are addressed we'll be ready to merge.

contextualized/regression/trainers.py

contextualized/regression/tests.py

cnellington · 2023-02-07T16:09:36Z

contextualized/regression/tests.py

-        """
-        print(f"\n{type(model)} quicktest")
+        # get subset of data to use for y_true
+        def get_xy_pred(dc):


Tests should align with our usecases, and the datamodules main function is to hide all the individual datasets from the user. We should use the dataset kwarg in the trainer to specify prediction data, rather than adding functions like get_xy_pred and _get_model. I'd recommend deleting these and using the trainer dataset kwarg in _quicktest.

That makes sense. The main reason I created get_model was for handling the various y_true's (now dependent on predict type, dataclass type (dl, dm), and model type (correlation, etc.)). Would we want to add a get_y_true to each trainer along what you suggest?

I just made a commit that hopefully can show what I mean (my changes to datasets.py weren't syncing before, but this should be updated too now!) At least with the tests we have, everything runs.

The only thing is I'm not sure is whether the y_true are calculated in a universal way. For instance, if I'm remembering, you had suggested changing y_true of correlation to y_true = np.tile(y_dset[:, :, np.newaxis], (1, 1, self.X.shape[-1] + self.Y.shape[-1])), but would this be the general behavior?

cnellington · 2023-02-07T16:11:02Z

contextualized/regression/datasets.py

+
+        self._generate_samples()
+
+    def _generate_samples(self):


Might need to be moved to __init__ now.

contextualized/regression/datasets.py

cnellington · 2023-02-07T16:16:45Z

contextualized/regression/datasets.py


-    def __iter__(self):
+        if idx == []:  # use full dataset
+            self.idx = range(self.n)


I think this still needs attention. idx still needs to select samples before self.samples is generated. Also let's use None as a default "unspecified" kwarg since [] isn't adding any important functionality.

cnellington

Great catch on the differences between y_true in all the models! I found one major bug and made a few comments on a few features that might not be exposed to users.

.gitignore

contextualized/regression/datasets.py

contextualized/regression/tests.py

contextualized/regression/trainers.py

contextualized/regression/datamodules.py

Merge Dev to Main

- Update docs with split fitting/analysis tutorials - Added support for 1D labels in contextualized regression

cnellington · 2023-04-06T16:40:53Z

Closing because this code exists in a new branch now #177.

wtt102 requested review from blengerich and cnellington November 6, 2022 19:08

cnellington added 3 commits November 20, 2022 14:40

Merge pull request AdaptInfer#147 from cnellington/dev

00dc129

Bugfixes for easy modules predict_params, added early stopping kwargs to easy modules.

Merge pull request AdaptInfer#148 from cnellington/dev

9a9549f

update package version

Merge pull request AdaptInfer#152 from cnellington/dev

30cb09c

Release v0.2.4

wtt102 force-pushed the main branch from 3e8068c to 30cb09c Compare December 11, 2022 15:29

cnellington requested changes Dec 12, 2022

View reviewed changes

cnellington requested changes Dec 16, 2022

View reviewed changes

cnellington requested changes Feb 7, 2023

View reviewed changes

cnellington added 2 commits February 12, 2023 18:04

Merge pull request AdaptInfer#164 from cnellington/dev

f4a73b7

Merge Dev to Main

Merge pull request AdaptInfer#167 from cnellington/dev

a73c604

- Update docs with split fitting/analysis tutorials - Added support for 1D labels in contextualized regression

wtt102 force-pushed the main branch from bc434d3 to a73c604 Compare March 16, 2023 15:19

cnellington closed this Apr 6, 2023

Integrate datamodules #141

Integrate datamodules #141

Uh oh!

Conversation

wtt102 commented Nov 6, 2022

Uh oh!

cnellington left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cnellington left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cnellington left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cnellington Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cnellington left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cnellington commented Apr 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

cnellington left a comment •

edited

Loading

cnellington Feb 7, 2023 •

edited

Loading