-
Notifications
You must be signed in to change notification settings - Fork 13
Integrate datamodules #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Bugfixes for easy modules predict_params, added early stopping kwargs to easy modules.
update package version
Release v0.2.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good for this update. Minor comments on style and logic. The tests aren't passing for Markov datamodules because we're testing for convergence on a random X (i.e. we shouldn't expect generalization to held-out sets because there's no function to learn). You might concatenate X and Y features into a single matrix in the markov _quicktests to fix this. So long as we don't see terrible overfitting on the X -> X regression, the X -> Y regression should converge.
cnellington
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Some overall comments:
It seems like there's a major bug in the dataloader indexing that we both overlooked in the last review. Given the tricky logic issues we've seen from trying to make the dataloaders index-based, we need some dataset tests to verify that the dataloaders actually contain all the data we think they do. I'd suggest creating tests where you try to reconstruct the input matrices from the dataloaders. This reversed construction should reveal any bugs we have with indexing.
|
|
||
| def __iter__(self): | ||
| if idx == []: # use full dataset | ||
| self.idx = range(self.n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about idx. It's used to select from all_samples (which, in the multitask dataset is length n * y_dim, and in the univariate multitask dataset is n * y_dim * x_dim), but is set by default to a range of n. This is almost certainly causing a bug in the multitask datasets right now, where we select n entries from all_samples to place into samples.
The name of the kwarg makes me think this should select samples from C, X, and Y before the all_samples transformation.
If it is used, idx should be set to index the whole dataset as the default kwarg. Currently it is set to [] by default and reset immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. Before "_generate_sample_ids" was abstracted, each dataset had its own variation of the line self.sample_ids = [self.all_sample_ids[i] for i in self.idx]. For instance something like self.sample_ids = [s for s in self.all_samples if s[1] in self.idx] (in the case where length of total samples is n * y_dim, and y_dim=3). This made it easier at the time to have randomization work, because generating random indeces for test/train/val was happening at the datamodule level, but iterating to create additional samples (eg, y_dim * n) was happening at the dataset level. I forgot about this when doing the abstracting. I realize this can easily be replaced by specifying the dataset type in the datamodule and using if statements to create the indeces of the full range beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the multi-task indexing logic should only live in the actual datasets (not the datamodule). That way we only have to think about sample-level indexing outside of the datasets.
I think it might make more sense to trim the input C, X, Y matrices according to the sample idx in the baseclass __init__, and then create the randomized multi-task sample list using these trimmed matrices and the subclass-specific __len__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still needs attention. idx still needs to select samples before self.samples is generated. Also let's use None as a default "unspecified" kwarg since [] isn't adding any important functionality.
|
|
||
| self._generate_samples() | ||
|
|
||
| def _generate_samples(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used outside of __iter__? If not, this should all be contained in __iter__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to be moved to __init__ now.
| self.val_dataset = self.dataset(self.C, self.X, self.Y, idx=self.val_idx) | ||
| # self.pred_dataset = self.dataset(self.C, self.X, self.Y, idx=self.pred_idx) | ||
|
|
||
| def setup(self, stage: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear what should be happening here, this method does nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example code that torch provides, setup is one of the functions that I think they mention is included in order to have operations specific to each of the different runs. Not sure if this would be useful or not in the future for our purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Sounds like we should use this to control which dataloader is being referenced for loading samples, which would address some of the other comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great idea. In reference to your other comment, pytorch does seem to require a predict_dataloader, so I have the “setup” function now being used to swap which of the three datasets is used by predict_dataloader. I’ve so far left out stages “train/test/val” from the setup function, as they aren’t needed yet and to minimize holding onto excess variables. Currently troubleshooting some bugs, but I’ll let you know when I can update the repo to reflect these changes.
| self.train_dataset = self.dataset(self.C, self.X, self.Y, idx=self.train_idx) | ||
| self.test_dataset = self.dataset(self.C, self.X, self.Y, idx=self.test_idx) | ||
| self.val_dataset = self.dataset(self.C, self.X, self.Y, idx=self.val_idx) | ||
| # self.pred_dataset = self.dataset(self.C, self.X, self.Y, idx=self.pred_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a pred dataset? Seems like we would only ever need full, train, test, and val.
| """ | ||
|
|
||
| def predict_params(self, model, dataloader): | ||
| def predict_params(self, model, dataclass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a good place to specify what set in the datamodule we're trying to predict from.
cnellington
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I think once these comments are addressed we'll be ready to merge.
contextualized/regression/tests.py
Outdated
| """ | ||
| print(f"\n{type(model)} quicktest") | ||
| # get subset of data to use for y_true | ||
| def get_xy_pred(dc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests should align with our usecases, and the datamodules main function is to hide all the individual datasets from the user. We should use the dataset kwarg in the trainer to specify prediction data, rather than adding functions like get_xy_pred and _get_model. I'd recommend deleting these and using the trainer dataset kwarg in _quicktest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. The main reason I created get_model was for handling the various y_true's (now dependent on predict type, dataclass type (dl, dm), and model type (correlation, etc.)). Would we want to add a get_y_true to each trainer along what you suggest?
I just made a commit that hopefully can show what I mean (my changes to datasets.py weren't syncing before, but this should be updated too now!) At least with the tests we have, everything runs.
The only thing is I'm not sure is whether the y_true are calculated in a universal way. For instance, if I'm remembering, you had suggested changing y_true of correlation to y_true = np.tile(y_dset[:, :, np.newaxis], (1, 1, self.X.shape[-1] + self.Y.shape[-1])), but would this be the general behavior?
|
|
||
| self._generate_samples() | ||
|
|
||
| def _generate_samples(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to be moved to __init__ now.
|
|
||
| def __iter__(self): | ||
| if idx == []: # use full dataset | ||
| self.idx = range(self.n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still needs attention. idx still needs to select samples before self.samples is generated. Also let's use None as a default "unspecified" kwarg since [] isn't adding any important functionality.
cnellington
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch on the differences between y_true in all the models! I found one major bug and made a few comments on a few features that might not be exposed to users.
Merge Dev to Main
- Update docs with split fitting/analysis tutorials - Added support for 1D labels in contextualized regression
|
Closing because this code exists in a new branch now #177. |
Completed
In progress / needed