Open
Conversation
Member
tarasio-mirror
left a comment
There was a problem hiding this comment.
Thanks for the PR @Priyansi !
I've added some suggestions related to coding style.
Comment on lines
+132
to
+138
| " trainset = CIFAR10(\n", | ||
| " root=data_dir, train=True, download=True, transform=transform)\n", | ||
| "\n", | ||
| " testset = CIFAR10(\n", | ||
| " root=data_dir, train=False, download=True, transform=transform)\n", | ||
| "\n", | ||
| " return trainset, testset" |
Member
There was a problem hiding this comment.
Suggested change
| " trainset = CIFAR10(\n", | |
| " root=data_dir, train=True, download=True, transform=transform)\n", | |
| "\n", | |
| " testset = CIFAR10(\n", | |
| " root=data_dir, train=False, download=True, transform=transform)\n", | |
| "\n", | |
| " return trainset, testset" | |
| " trainset = CIFAR10(\n", | |
| " root=data_dir, train=True, download=True, transform=transform\n", | |
| " )\n", | |
| " testset = CIFAR10(\n", | |
| " root=data_dir, train=False, download=True, transform=transform\n", | |
| " )\n", | |
| " return trainset, testset" |
Comment on lines
+150
to
+164
| " train_subset, val_subset = random_split(\n", | ||
| " trainset, [test_abs, len(trainset) - test_abs])\n", | ||
| "\n", | ||
| " trainloader = idist.auto_dataloader(\n", | ||
| " train_subset,\n", | ||
| " batch_size=int(config[\"batch_size\"]),\n", | ||
| " shuffle=True,\n", | ||
| " num_workers=8)\n", | ||
| " valloader = idist.auto_dataloader(\n", | ||
| " val_subset,\n", | ||
| " batch_size=int(config[\"batch_size\"]),\n", | ||
| " shuffle=True,\n", | ||
| " num_workers=8)\n", | ||
| " \n", | ||
| " return trainloader, valloader" |
Member
There was a problem hiding this comment.
Suggested change
| " train_subset, val_subset = random_split(\n", | |
| " trainset, [test_abs, len(trainset) - test_abs])\n", | |
| "\n", | |
| " trainloader = idist.auto_dataloader(\n", | |
| " train_subset,\n", | |
| " batch_size=int(config[\"batch_size\"]),\n", | |
| " shuffle=True,\n", | |
| " num_workers=8)\n", | |
| " valloader = idist.auto_dataloader(\n", | |
| " val_subset,\n", | |
| " batch_size=int(config[\"batch_size\"]),\n", | |
| " shuffle=True,\n", | |
| " num_workers=8)\n", | |
| " \n", | |
| " return trainloader, valloader" | |
| " train_subset, val_subset = random_split(\n", | |
| " trainset, [test_abs, len(trainset) - test_abs]\n", | |
| " )\n", | |
| " trainloader = idist.auto_dataloader(\n", | |
| " train_subset,\n", | |
| " batch_size=int(config[\"batch_size\"]),\n", | |
| " shuffle=True,\n", | |
| " num_workers=8\n", | |
| " )\n", | |
| " valloader = idist.auto_dataloader(\n", | |
| " val_subset,\n", | |
| " batch_size=int(config[\"batch_size\"]),\n", | |
| " shuffle=True,\n", | |
| " num_workers=8\n", | |
| " )\n", | |
| " return trainloader, valloader" |
Comment on lines
+217
to
+231
| "def initialize(config, checkpoint_dir):\n", | ||
| " model = idist.auto_model(Net(config[\"l1\"], config[\"l2\"]))\n", | ||
| "\n", | ||
| " device = idist.device()\n", | ||
| "\n", | ||
| " criterion = nn.CrossEntropyLoss()\n", | ||
| " optimizer = idist.auto_optim(optim.SGD(model.parameters(), lr=config[\"lr\"], momentum=0.9))\n", | ||
| "\n", | ||
| " if checkpoint_dir:\n", | ||
| " model_state, optimizer_state = torch.load(\n", | ||
| " os.path.join(checkpoint_dir, \"checkpoint\"))\n", | ||
| " model.load_state_dict(model_state)\n", | ||
| " optimizer.load_state_dict(optimizer_state)\n", | ||
| " \n", | ||
| " return model, device, criterion, optimizer" |
Member
There was a problem hiding this comment.
Suggested change
| "def initialize(config, checkpoint_dir):\n", | |
| " model = idist.auto_model(Net(config[\"l1\"], config[\"l2\"]))\n", | |
| "\n", | |
| " device = idist.device()\n", | |
| "\n", | |
| " criterion = nn.CrossEntropyLoss()\n", | |
| " optimizer = idist.auto_optim(optim.SGD(model.parameters(), lr=config[\"lr\"], momentum=0.9))\n", | |
| "\n", | |
| " if checkpoint_dir:\n", | |
| " model_state, optimizer_state = torch.load(\n", | |
| " os.path.join(checkpoint_dir, \"checkpoint\"))\n", | |
| " model.load_state_dict(model_state)\n", | |
| " optimizer.load_state_dict(optimizer_state)\n", | |
| " \n", | |
| " return model, device, criterion, optimizer" | |
| "def initialize(config, checkpoint_dir):\n", | |
| " model = idist.auto_model(Net(config[\"l1\"], config[\"l2\"]))\n", | |
| "\n", | |
| " device = idist.device()\n", | |
| "\n", | |
| " criterion = nn.CrossEntropyLoss()\n", | |
| " optimizer = idist.auto_optim(\n", | |
| " optim.SGD(model.parameters(), lr=config[\"lr\"], momentum=0.9)\n", | |
| " )\n", | |
| "\n", | |
| " if checkpoint_dir:\n", | |
| " model_state, optimizer_state = torch.load(\n", | |
| " os.path.join(checkpoint_dir, \"checkpoint\")\n", | |
| " )\n", | |
| " model.load_state_dict(model_state)\n", | |
| " optimizer.load_state_dict(optimizer_state)\n", | |
| "\n", | |
| " return model, device, criterion, optimizer" |
Comment on lines
+265
to
+293
| "def train_cifar(config, data_dir=None, checkpoint_dir=None):\n", | ||
| " trainloader, valloader = get_train_val_loaders(config, data_dir)\n", | ||
| " model, device, criterion, optimizer = initialize(config, checkpoint_dir)\n", | ||
| " \n", | ||
| " trainer = create_supervised_trainer(model, optimizer, criterion, device=device, non_blocking=True)\n", | ||
| " \n", | ||
| " avg_output = RunningAverage(output_transform=lambda x: x)\n", | ||
| " avg_output.attach(trainer, 'running_avg_loss')\n", | ||
| " \n", | ||
| " val_evaluator = create_supervised_evaluator(model, metrics={ \"accuracy\": Accuracy(), \"loss\": Loss(criterion)}, device=device, non_blocking=True)\n", | ||
| " \n", | ||
| " @trainer.on(Events.ITERATION_COMPLETED(every=2000))\n", | ||
| " def log_training_loss(engine):\n", | ||
| " print(f\"Epoch[{engine.state.epoch}], Iter[{engine.state.iteration}] Loss: {engine.state.output:.2f} Running Avg Loss: {engine.state.metrics['running_avg_loss']:.2f}\")\n", | ||
| "\n", | ||
| "\n", | ||
| " @trainer.on(Events.EPOCH_COMPLETED)\n", | ||
| " def log_validation_results(trainer):\n", | ||
| " val_evaluator.run(valloader)\n", | ||
| " metrics = val_evaluator.state.metrics\n", | ||
| " print(f\"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}\")\n", | ||
| "\n", | ||
| " with tune.checkpoint_dir(trainer.state.epoch) as checkpoint_dir:\n", | ||
| " path = os.path.join(checkpoint_dir, \"checkpoint\")\n", | ||
| " torch.save((model.state_dict(), optimizer.state_dict()), path)\n", | ||
| " \n", | ||
| " tune.report(loss=metrics['loss'], accuracy=metrics['accuracy']) \n", | ||
| "\n", | ||
| " trainer.run(trainloader, max_epochs=10) " |
Member
There was a problem hiding this comment.
Suggested change
| "def train_cifar(config, data_dir=None, checkpoint_dir=None):\n", | |
| " trainloader, valloader = get_train_val_loaders(config, data_dir)\n", | |
| " model, device, criterion, optimizer = initialize(config, checkpoint_dir)\n", | |
| " \n", | |
| " trainer = create_supervised_trainer(model, optimizer, criterion, device=device, non_blocking=True)\n", | |
| " \n", | |
| " avg_output = RunningAverage(output_transform=lambda x: x)\n", | |
| " avg_output.attach(trainer, 'running_avg_loss')\n", | |
| " \n", | |
| " val_evaluator = create_supervised_evaluator(model, metrics={ \"accuracy\": Accuracy(), \"loss\": Loss(criterion)}, device=device, non_blocking=True)\n", | |
| " \n", | |
| " @trainer.on(Events.ITERATION_COMPLETED(every=2000))\n", | |
| " def log_training_loss(engine):\n", | |
| " print(f\"Epoch[{engine.state.epoch}], Iter[{engine.state.iteration}] Loss: {engine.state.output:.2f} Running Avg Loss: {engine.state.metrics['running_avg_loss']:.2f}\")\n", | |
| "\n", | |
| "\n", | |
| " @trainer.on(Events.EPOCH_COMPLETED)\n", | |
| " def log_validation_results(trainer):\n", | |
| " val_evaluator.run(valloader)\n", | |
| " metrics = val_evaluator.state.metrics\n", | |
| " print(f\"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}\")\n", | |
| "\n", | |
| " with tune.checkpoint_dir(trainer.state.epoch) as checkpoint_dir:\n", | |
| " path = os.path.join(checkpoint_dir, \"checkpoint\")\n", | |
| " torch.save((model.state_dict(), optimizer.state_dict()), path)\n", | |
| " \n", | |
| " tune.report(loss=metrics['loss'], accuracy=metrics['accuracy']) \n", | |
| "\n", | |
| " trainer.run(trainloader, max_epochs=10) " | |
| "def train_cifar(config, data_dir=None, checkpoint_dir=None):\n", | |
| " trainloader, valloader = get_train_val_loaders(config, data_dir)\n", | |
| " model, device, criterion, optimizer = initialize(config, checkpoint_dir)\n", | |
| "\n", | |
| " trainer = create_supervised_trainer(\n", | |
| " model, optimizer, criterion, device=device, non_blocking=True\n", | |
| " )\n", | |
| "\n", | |
| " avg_output = RunningAverage(output_transform=lambda x: x)\n", | |
| " avg_output.attach(trainer, \"running_avg_loss\")\n", | |
| "\n", | |
| " val_evaluator = create_supervised_evaluator(\n", | |
| " model,\n", | |
| " metrics={\"accuracy\": Accuracy(), \"loss\": Loss(criterion)},\n", | |
| " device=device,\n", | |
| " non_blocking=True,\n", | |
| " )\n", | |
| "\n", | |
| " @trainer.on(Events.ITERATION_COMPLETED(every=2000))\n", | |
| " def log_training_loss(engine):\n", | |
| " print(\n", | |
| " f\"Epoch[{engine.state.epoch}], Iter[{engine.state.iteration}] Loss: {engine.state.output:.2f} Running Avg Loss: {engine.state.metrics['running_avg_loss']:.2f}\"\n", | |
| " )\n", | |
| "\n", | |
| " @trainer.on(Events.EPOCH_COMPLETED)\n", | |
| " def log_validation_results(trainer):\n", | |
| " val_evaluator.run(valloader)\n", | |
| " metrics = val_evaluator.state.metrics\n", | |
| " print(\n", | |
| " f\"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}\"\n", | |
| " )\n", | |
| "\n", | |
| " with tune.checkpoint_dir(trainer.state.epoch) as checkpoint_dir:\n", | |
| " path = os.path.join(checkpoint_dir, \"checkpoint\")\n", | |
| " torch.save((model.state_dict(), optimizer.state_dict()), path)\n", | |
| "\n", | |
| " tune.report(loss=metrics[\"loss\"], accuracy=metrics[\"accuracy\"])\n", | |
| "\n", | |
| " trainer.run(trainloader, max_epochs=10)" |
Comment on lines
+311
to
+327
| "def test_best_model(best_trial, data_dir=None):\n", | ||
| " _, testset = load_data(data_dir)\n", | ||
| " \n", | ||
| " best_trained_model = idist.auto_model(Net(best_trial.config[\"l1\"], best_trial.config[\"l2\"]))\n", | ||
| " device = idist.device()\n", | ||
| "\n", | ||
| " best_checkpoint_dir = best_trial.checkpoint.value\n", | ||
| " model_state, optimizer_state = torch.load(os.path.join(\n", | ||
| " best_checkpoint_dir, \"checkpoint\"))\n", | ||
| " best_trained_model.load_state_dict(model_state)\n", | ||
| "\n", | ||
| " test_evaluator = create_supervised_evaluator(best_trained_model, metrics={\"Accuracy\": Accuracy()}, device=device, non_blocking=True)\n", | ||
| "\n", | ||
| " testloader = idist.auto_dataloader(testset, batch_size=4, shuffle=False, num_workers=2)\n", | ||
| "\n", | ||
| " test_evaluator.run(testloader)\n", | ||
| " print(f\"Best trial test set accuracy: {test_evaluator.state.metrics}\")" |
Member
There was a problem hiding this comment.
Suggested change
| "def test_best_model(best_trial, data_dir=None):\n", | |
| " _, testset = load_data(data_dir)\n", | |
| " \n", | |
| " best_trained_model = idist.auto_model(Net(best_trial.config[\"l1\"], best_trial.config[\"l2\"]))\n", | |
| " device = idist.device()\n", | |
| "\n", | |
| " best_checkpoint_dir = best_trial.checkpoint.value\n", | |
| " model_state, optimizer_state = torch.load(os.path.join(\n", | |
| " best_checkpoint_dir, \"checkpoint\"))\n", | |
| " best_trained_model.load_state_dict(model_state)\n", | |
| "\n", | |
| " test_evaluator = create_supervised_evaluator(best_trained_model, metrics={\"Accuracy\": Accuracy()}, device=device, non_blocking=True)\n", | |
| "\n", | |
| " testloader = idist.auto_dataloader(testset, batch_size=4, shuffle=False, num_workers=2)\n", | |
| "\n", | |
| " test_evaluator.run(testloader)\n", | |
| " print(f\"Best trial test set accuracy: {test_evaluator.state.metrics}\")" | |
| "def test_best_model(best_trial, data_dir=None):\n", | |
| " _, testset = load_data(data_dir)\n", | |
| "\n", | |
| " best_trained_model = idist.auto_model(\n", | |
| " Net(best_trial.config[\"l1\"], best_trial.config[\"l2\"])\n", | |
| " )\n", | |
| " device = idist.device()\n", | |
| "\n", | |
| " best_checkpoint_dir = best_trial.checkpoint.value\n", | |
| " model_state, optimizer_state = torch.load(\n", | |
| " os.path.join(best_checkpoint_dir, \"checkpoint\")\n", | |
| " )\n", | |
| " best_trained_model.load_state_dict(model_state)\n", | |
| "\n", | |
| " test_evaluator = create_supervised_evaluator(\n", | |
| " best_trained_model,\n", | |
| " metrics={\"Accuracy\": Accuracy()},\n", | |
| " device=device,\n", | |
| " non_blocking=True,\n", | |
| " )\n", | |
| "\n", | |
| " testloader = idist.auto_dataloader(\n", | |
| " testset, batch_size=4, shuffle=False, num_workers=2\n", | |
| " )\n", | |
| "\n", | |
| " test_evaluator.run(testloader)\n", | |
| " print(f\"Best trial test set accuracy: {test_evaluator.state.metrics}\")" |
Comment on lines
+358
to
+389
| "def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):\n", | ||
| " data_dir = os.path.abspath(\"./data\")\n", | ||
| " load_data(data_dir)\n", | ||
| " \n", | ||
| " config = {\n", | ||
| " \"l1\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n", | ||
| " \"l2\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n", | ||
| " \"lr\": tune.loguniform(1e-4, 1e-1),\n", | ||
| " \"batch_size\": tune.choice([2, 4, 8, 16])\n", | ||
| " }\n", | ||
| " scheduler = ASHAScheduler(\n", | ||
| " metric=\"loss\",\n", | ||
| " mode=\"min\",\n", | ||
| " max_t=max_num_epochs,\n", | ||
| " grace_period=1,\n", | ||
| " reduction_factor=2)\n", | ||
| " reporter = CLIReporter(\n", | ||
| " metric_columns=[\"loss\", \"accuracy\", \"training_iteration\"])\n", | ||
| " result = tune.run(\n", | ||
| " partial(train_cifar, data_dir=data_dir),\n", | ||
| " resources_per_trial={\"cpu\": 2, \"gpu\": gpus_per_trial},\n", | ||
| " config=config,\n", | ||
| " num_samples=num_samples,\n", | ||
| " scheduler=scheduler,\n", | ||
| " progress_reporter=reporter)\n", | ||
| "\n", | ||
| " best_trial = result.get_best_trial(\"loss\", \"min\", \"last\")\n", | ||
| " print(f\"Best trial config: {best_trial.config}\")\n", | ||
| " print(f\"Best trial final validation loss: {best_trial.last_result['loss']}\")\n", | ||
| " print(f\"Best trial final validation accuracy: {best_trial.last_result['accuracy']}\")\n", | ||
| " \n", | ||
| " test_best_model(best_trial, data_dir)" |
Member
There was a problem hiding this comment.
Suggested change
| "def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):\n", | |
| " data_dir = os.path.abspath(\"./data\")\n", | |
| " load_data(data_dir)\n", | |
| " \n", | |
| " config = {\n", | |
| " \"l1\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n", | |
| " \"l2\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n", | |
| " \"lr\": tune.loguniform(1e-4, 1e-1),\n", | |
| " \"batch_size\": tune.choice([2, 4, 8, 16])\n", | |
| " }\n", | |
| " scheduler = ASHAScheduler(\n", | |
| " metric=\"loss\",\n", | |
| " mode=\"min\",\n", | |
| " max_t=max_num_epochs,\n", | |
| " grace_period=1,\n", | |
| " reduction_factor=2)\n", | |
| " reporter = CLIReporter(\n", | |
| " metric_columns=[\"loss\", \"accuracy\", \"training_iteration\"])\n", | |
| " result = tune.run(\n", | |
| " partial(train_cifar, data_dir=data_dir),\n", | |
| " resources_per_trial={\"cpu\": 2, \"gpu\": gpus_per_trial},\n", | |
| " config=config,\n", | |
| " num_samples=num_samples,\n", | |
| " scheduler=scheduler,\n", | |
| " progress_reporter=reporter)\n", | |
| "\n", | |
| " best_trial = result.get_best_trial(\"loss\", \"min\", \"last\")\n", | |
| " print(f\"Best trial config: {best_trial.config}\")\n", | |
| " print(f\"Best trial final validation loss: {best_trial.last_result['loss']}\")\n", | |
| " print(f\"Best trial final validation accuracy: {best_trial.last_result['accuracy']}\")\n", | |
| " \n", | |
| " test_best_model(best_trial, data_dir)" | |
| "def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):\n", | |
| " data_dir = os.path.abspath(\"./data\")\n", | |
| " load_data(data_dir)\n", | |
| "\n", | |
| " config = {\n", | |
| " \"l1\": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),\n", | |
| " \"l2\": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),\n", | |
| " \"lr\": tune.loguniform(1e-4, 1e-1),\n", | |
| " \"batch_size\": tune.choice([2, 4, 8, 16]),\n", | |
| " }\n", | |
| " scheduler = ASHAScheduler(\n", | |
| " metric=\"loss\",\n", | |
| " mode=\"min\",\n", | |
| " max_t=max_num_epochs,\n", | |
| " grace_period=1,\n", | |
| " reduction_factor=2,\n", | |
| " )\n", | |
| " reporter = CLIReporter(metric_columns=[\"loss\", \"accuracy\", \"training_iteration\"])\n", | |
| " result = tune.run(\n", | |
| " partial(train_cifar, data_dir=data_dir),\n", | |
| " resources_per_trial={\"cpu\": 2, \"gpu\": gpus_per_trial},\n", | |
| " config=config,\n", | |
| " num_samples=num_samples,\n", | |
| " scheduler=scheduler,\n", | |
| " progress_reporter=reporter,\n", | |
| " )\n", | |
| "\n", | |
| " best_trial = result.get_best_trial(\"loss\", \"min\", \"last\")\n", | |
| " print(f\"Best trial config: {best_trial.config}\")\n", | |
| " print(f\"Best trial final validation loss: {best_trial.last_result['loss']}\")\n", | |
| " print(f\"Best trial final validation accuracy: {best_trial.last_result['accuracy']}\")\n", | |
| "\n", | |
| " test_best_model(best_trial, data_dir)" |
Member
|
What about adding some summarizing sentences about the best trial, how to interpret the results? |
| "For every trial, Ray Tune will randomly sample a combination of parameters from these search spaces. It will then train a number of models in parallel and find the best performing one among these. \n", | ||
| "We also use the `ASHAScheduler()` which is one of the trial schedulers that aggressively terminate low-performing trials.\n", | ||
| "Apart from that, we leverage the `CLIReporter()` to prettify our outputs.\n", | ||
| "And then, we wrap `train_cifar` in functools.partial and pass it to `tune.run` along with other resources like the CPUs and GPUs available to use, the configurable parameters, the number of trials, scheduler and reporter.\n", |
Member
There was a problem hiding this comment.
nit
Suggested change
| "And then, we wrap `train_cifar` in functools.partial and pass it to `tune.run` along with other resources like the CPUs and GPUs available to use, the configurable parameters, the number of trials, scheduler and reporter.\n", | |
| "And then, we wrap `train_cifar` in `functools.partial` and pass it to `tune.run` along with other resources like the CPUs and GPUs available to use, the configurable parameters, the number of trials, scheduler and reporter.\n", |
| "id": "vJgTaKWU8Doq" | ||
| }, | ||
| "source": [ | ||
| "In this tutorial, we will see how [Ray Tune](https://docs.ray.io/en/stable/tune.html) can be used with Ignite for hyperparameter tuning. We will also compare it with other frameworks like [Optuna](https://optuna.org/) and [Ax](https://ax.dev/) for hyperparameter optimization.\n", |
Member
There was a problem hiding this comment.
are we going to add comparisons with ax and optuna?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #29