Skip to content

Conversation

@wtt102
Copy link
Contributor

@wtt102 wtt102 commented Nov 6, 2022

Completed

  • Regression now using map style datasets
  • Single datamodule in datamodules.py that gets instantiated within each regression lightning module
  • All unit tests run expect for markov (not converging); currently troubleshooting this.
  • Same results + speed between dataloader and datamodule methods

In progress / needed

  • Fix markov not converging
  • Get notmad to also instantiate its own datamodule (should be straightforward, and this looks pretty accessible given the current state I'm seeing)
  • Possibly abstract or move datamodules from notmad and regression to a single file (thinking keeping these separate may be less confusing)
  • Proofread documentation

Bugfixes for easy modules predict_params, added early stopping kwargs to easy modules.
Copy link
Collaborator

@cnellington cnellington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good for this update. Minor comments on style and logic. The tests aren't passing for Markov datamodules because we're testing for convergence on a random X (i.e. we shouldn't expect generalization to held-out sets because there's no function to learn). You might concatenate X and Y features into a single matrix in the markov _quicktests to fix this. So long as we don't see terrible overfitting on the X -> X regression, the X -> Y regression should converge.

Copy link
Collaborator

@cnellington cnellington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Some overall comments:

It seems like there's a major bug in the dataloader indexing that we both overlooked in the last review. Given the tricky logic issues we've seen from trying to make the dataloaders index-based, we need some dataset tests to verify that the dataloaders actually contain all the data we think they do. I'd suggest creating tests where you try to reconstruct the input matrices from the dataloaders. This reversed construction should reveal any bugs we have with indexing.


def __iter__(self):
if idx == []: # use full dataset
self.idx = range(self.n)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about idx. It's used to select from all_samples (which, in the multitask dataset is length n * y_dim, and in the univariate multitask dataset is n * y_dim * x_dim), but is set by default to a range of n. This is almost certainly causing a bug in the multitask datasets right now, where we select n entries from all_samples to place into samples.

The name of the kwarg makes me think this should select samples from C, X, and Y before the all_samples transformation.

If it is used, idx should be set to index the whole dataset as the default kwarg. Currently it is set to [] by default and reset immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Before "_generate_sample_ids" was abstracted, each dataset had its own variation of the line self.sample_ids = [self.all_sample_ids[i] for i in self.idx]. For instance something like self.sample_ids = [s for s in self.all_samples if s[1] in self.idx] (in the case where length of total samples is n * y_dim, and y_dim=3). This made it easier at the time to have randomization work, because generating random indeces for test/train/val was happening at the datamodule level, but iterating to create additional samples (eg, y_dim * n) was happening at the dataset level. I forgot about this when doing the abstracting. I realize this can easily be replaced by specifying the dataset type in the datamodule and using if statements to create the indeces of the full range beforehand.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the multi-task indexing logic should only live in the actual datasets (not the datamodule). That way we only have to think about sample-level indexing outside of the datasets.

I think it might make more sense to trim the input C, X, Y matrices according to the sample idx in the baseclass __init__, and then create the randomized multi-task sample list using these trimmed matrices and the subclass-specific __len__.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs attention. idx still needs to select samples before self.samples is generated. Also let's use None as a default "unspecified" kwarg since [] isn't adding any important functionality.


self._generate_samples()

def _generate_samples(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used outside of __iter__? If not, this should all be contained in __iter__.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to be moved to __init__ now.

self.val_dataset = self.dataset(self.C, self.X, self.Y, idx=self.val_idx)
# self.pred_dataset = self.dataset(self.C, self.X, self.Y, idx=self.pred_idx)

def setup(self, stage: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what should be happening here, this method does nothing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example code that torch provides, setup is one of the functions that I think they mention is included in order to have operations specific to each of the different runs. Not sure if this would be useful or not in the future for our purposes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Sounds like we should use this to control which dataloader is being referenced for loading samples, which would address some of the other comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great idea. In reference to your other comment, pytorch does seem to require a predict_dataloader, so I have the “setup” function now being used to swap which of the three datasets is used by predict_dataloader. I’ve so far left out stages “train/test/val” from the setup function, as they aren’t needed yet and to minimize holding onto excess variables. Currently troubleshooting some bugs, but I’ll let you know when I can update the repo to reflect these changes.

self.train_dataset = self.dataset(self.C, self.X, self.Y, idx=self.train_idx)
self.test_dataset = self.dataset(self.C, self.X, self.Y, idx=self.test_idx)
self.val_dataset = self.dataset(self.C, self.X, self.Y, idx=self.val_idx)
# self.pred_dataset = self.dataset(self.C, self.X, self.Y, idx=self.pred_idx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a pred dataset? Seems like we would only ever need full, train, test, and val.

"""

def predict_params(self, model, dataloader):
def predict_params(self, model, dataclass):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good place to specify what set in the datamodule we're trying to predict from.

Copy link
Collaborator

@cnellington cnellington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I think once these comments are addressed we'll be ready to merge.

"""
print(f"\n{type(model)} quicktest")
# get subset of data to use for y_true
def get_xy_pred(dc):
Copy link
Collaborator

@cnellington cnellington Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should align with our usecases, and the datamodules main function is to hide all the individual datasets from the user. We should use the dataset kwarg in the trainer to specify prediction data, rather than adding functions like get_xy_pred and _get_model. I'd recommend deleting these and using the trainer dataset kwarg in _quicktest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. The main reason I created get_model was for handling the various y_true's (now dependent on predict type, dataclass type (dl, dm), and model type (correlation, etc.)). Would we want to add a get_y_true to each trainer along what you suggest?

I just made a commit that hopefully can show what I mean (my changes to datasets.py weren't syncing before, but this should be updated too now!) At least with the tests we have, everything runs.

The only thing is I'm not sure is whether the y_true are calculated in a universal way. For instance, if I'm remembering, you had suggested changing y_true of correlation to y_true = np.tile(y_dset[:, :, np.newaxis], (1, 1, self.X.shape[-1] + self.Y.shape[-1])), but would this be the general behavior?


self._generate_samples()

def _generate_samples(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to be moved to __init__ now.


def __iter__(self):
if idx == []: # use full dataset
self.idx = range(self.n)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs attention. idx still needs to select samples before self.samples is generated. Also let's use None as a default "unspecified" kwarg since [] isn't adding any important functionality.

Copy link
Collaborator

@cnellington cnellington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch on the differences between y_true in all the models! I found one major bug and made a few comments on a few features that might not be exposed to users.

- Update docs with split fitting/analysis tutorials
- Added support for 1D labels in contextualized regression
@cnellington
Copy link
Collaborator

Closing because this code exists in a new branch now #177.

@cnellington cnellington closed this Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants