Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/dlns.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"# Deep Linear Networks\n",
"\n",
"Warning: This notebook is currently functional, but does not show actually reproduce the results it's attempting to. Use only as inspiration, not as gospel. For more well-calibrated LLC estimates of a simple linear network, see tests/slt/rrr_test.py.\n",
"Warning: This notebook is currently functional, but does not actually reproduce the results it's attempting to. Use only as inspiration, not as gospel. For more well-calibrated LLC estimates of a simple linear network, see tests/slt/rrr_test.py.\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timaeus-research/devinterp/blob/main/examples/dlns.ipynb)\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion examples/epsilon_beta.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"\n",
"This notebook aims to visualize $\\hat{\\lambda}_n^\\beta$ for various values of $\\beta$ (inverse temperature) and $\\epsilon$ (step size). \n",
"\n",
"Roughly (per Adrian Xu's master thesis):\n",
"Roughly (per Adrian Xu's master's thesis):\n",
"- $\\beta$ can be tuned via graphing $\\hat{\\lambda}_n^\\beta$ for a sweep of $\\beta$, and using $\\beta$ in a range around the critical points on the graph.\n",
"- $\\epsilon$ should be the greatest possible value that doesn't cause excessive numerical instability or cause the SGLD chains to fail to converge. An MALA proposal acceptance rate (see `sgld_calibration.ipynb`) between 0.9 and 0.95 is roughly optimal."
]
Expand Down
6 changes: 3 additions & 3 deletions examples/grokking.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
"source": [
"This notebook aims to show how LLC estimation is calibrated in a simple modular addition grokking example, showing a moderately interesting result at the end.\n",
"\n",
"We'll starting off with some standard grokking code, adapted loosely from Nina Panickssery and Dmitry Vaintrob's [modular addition learning coefficient post](https://www.alignmentforum.org/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition) and [github code repo](https://github.com/nrimsky/devinterp). (Thank you for your help!)"
"We'll start off with some standard grokking code, adapted loosely from Nina Panickssery and Dmitry Vaintrob's [modular addition learning coefficient post](https://www.alignmentforum.org/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition) and [github code repo](https://github.com/nrimsky/devinterp). (Thank you for your help!)"
]
},
{
Expand Down Expand Up @@ -426,7 +426,7 @@
"id": "bP4h1UJEeP94"
},
"source": [
"In order to get LLC estimates for this simple grokking model over training, we first need to choose hyperparameters. The most important ones to calibrate are epsilon (the SGLD learning rate / step size) and n\\*beta (the effective inverse temperature). Let's run a quick sweep over a wide range of epsilon and n\\*beta, and look for a range of values within this which shows little change in LLC change in LLC values when we change epsilon and nbeta. We can use `devinterp.vis_utils.EpsilonBetaAnalyzer` for this."
"In order to get LLC estimates for this simple grokking model over training, we first need to choose hyperparameters. The most important ones to calibrate are epsilon (the SGLD learning rate / step size) and n\\*beta (the effective inverse temperature). Let's run a quick sweep over a wide range of epsilon and n\\*beta, and look for a range of values within this which shows little change in LLC values when we change epsilon and nbeta. We can use `devinterp.vis_utils.EpsilonBetaAnalyzer` for this."
]
},
{
Expand Down Expand Up @@ -650,7 +650,7 @@
"id": "DlGfy4ZAeP-C"
},
"source": [
"From this, we can see that the effective sampled loss for low-ish nbetas (<100) shows very little dependence on the exact choice of nbeta. So let's a point in this flat region (~1), and a high-but-still-in-the-flat-region epsilon (0.03), so we don't need to run many draws, but still have little dependence of our samples on epsilon.\n",
"From this, we can see that the effective sampled loss for low-ish nbetas (<100) shows very little dependence on the exact choice of nbeta. So let's pick a point in this flat region (~1), and a high-but-still-in-the-flat-region epsilon (0.003), so we don't need to run many draws, but still have little dependence of our samples on epsilon.\n",
"\n",
"Let's check that the loss chain for these hyperparams looks decent, and then run LLC estimation on all trained checkpoints if it does."
]
Expand Down
10 changes: 5 additions & 5 deletions examples/introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,9 @@
"\n",
"## Local Learning Coefficients\n",
"\n",
"The first method we have have online is local learning coefficient estimation ([Lau et al. 2023](https://arxiv.org/abs/2308.12108)). \n",
"The first method we have online is local learning coefficient estimation ([Lau et al. 2023](https://arxiv.org/abs/2308.12108)). \n",
"\n",
"For an in-depth explaination, see [this post](https://www.lesswrong.com/posts/6g8cAftfQufLmFDYT/you-re-counting-your-parameters-wrong). The short version is that: \n",
"For an in-depth explanation, see [this post](https://www.lesswrong.com/posts/6g8cAftfQufLmFDYT/you-re-counting-your-parameters-wrong). The short version is that: \n",
"- The (local) learning coefficient $\\hat\\lambda$ is the \"correct\" measure of model complexity. Besides the loss, it's the most principled high-level way to compare models.\n",
"- We can cheaply estimate the learning coefficient associated to a choice of weights $\\hat w^*$ by using the following formula:\n",
"\n",
Expand Down Expand Up @@ -354,7 +354,7 @@
"\n",
"Below you'll see what's actually happening when you run `local_learning_coefficients`.\n",
"\n",
"We sample 10 different chains, with the same starting positions but different batch schedules and noise realizations at each step. For each of these chains, we take 200 steps using SGLD. We observe the loss at each of these points. At the end, we average the loss across chains, compare it to the initial loss, and apply a correction that depends on the dataset size to get the local learning coefficient. \n",
"We sample 3 different chains, with the same starting positions but different batch schedules and noise realizations at each step. For each of these chains, we take 100 steps using SGLD. We observe the loss at each of these points. At the end, we average the loss across chains, compare it to the initial loss, and apply a correction that depends on the dataset size to get the local learning coefficient. \n",
"\n",
"For a healthy chain, the Loss Trace should increase rapidly at first and then level off."
]
Expand Down Expand Up @@ -406,7 +406,7 @@
"- [`mnist.ipynb`](../examples/mnist.ipynb) showing how we can use LLC estimation to assess relative LLCs of MNIST models trained with different optimizers.\n",
"- [`sgld_calibration.ipynb`](../examples/sgld_calibration.ipynb) shows how to gain confidence in using SGLD-based LLC estimation to a model with unknown LLC.\n",
"- [`diagnostics.ipynb`](../examples/diagnostics.ipynb) shows how to use callbacks to diagnose if your sampling is going well.\n",
"- [`epsilon_beta.ipynb`](../examples/epsilon_beta.ipynb) shows how to use use a callback to calibrate SGLD hyperparameters.\n",
"- [`epsilon_beta.ipynb`](../examples/epsilon_beta.ipynb) shows how to use a callback to calibrate SGLD hyperparameters.\n",
"\n",
"For a small demo of how to use the library to study grokking, see [`grokking.ipynb`](../examples/grokking.ipynb)."
]
Expand All @@ -432,7 +432,7 @@
"- **Progress measures**. If you have an understanding of some structure at the end of training, you can roll that understanding backwards to track how that structure develops over time. \n",
"- **Probes**. Similarly, you can train a linear probe from activations onto features, then roll that probe back to previous checkpoints to measure how those features are learned. \n",
"- **Gradients**. Just look at the gradients! \n",
"- **Evals**. You can measure performance on a targeted benchmarks to track when the model learns the associated capabilities. \n",
"- **Evals**. You can measure performance on targeted benchmarks to track when the model learns the associated capabilities. \n",
"- **Covariance estimators**. That's a secret for now. More coming soon!"
]
}
Expand Down
4 changes: 2 additions & 2 deletions examples/mnist.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@
"metadata": {},
"outputs": [],
"source": [
"def emtpy_func():\n",
"def empty_func():\n",
" return (), ()\n",
"\n",
"\n",
Expand All @@ -198,7 +198,7 @@
" if model_key == \"sgd\":\n",
" optimizer.step()\n",
" else:\n",
" optimizer.step(emtpy_func, model, criterion)\n",
" optimizer.step(empty_func, model, criterion)\n",
" return train_loss / len(train_loader)\n",
"\n",
"\n",
Expand Down
16 changes: 8 additions & 8 deletions examples/sgld_calibration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -818,7 +818,7 @@
"id": "9d7e0ef8",
"metadata": {},
"source": [
"Judging by this, $\\epsilon = 1e^{-2}$ is out (did not converge), and a $\\gamma$ of $1$ is too low. A higher MALA acceptance prob would be better (ideally we'd aim for $.9$) but that might not be possible for this model. The higher learning rate is generally preferred, but we have to be careful not to get a thermalization peak at the start of sampling. Let's take a look at the loss curves next to check if our sampling works as expected."
"Judging by this, $\\epsilon = 10^{-2}$ is out (did not converge), and a $\\gamma$ of $1$ is too low. A higher MALA acceptance prob would be better (ideally we'd aim for $.9$) but that might not be possible for this model. The higher learning rate is generally preferred, but we have to be careful not to get a thermalization peak at the start of sampling. Let's take a look at the loss curves next to check if our sampling works as expected."
]
},
{
Expand Down Expand Up @@ -926,7 +926,7 @@
"id": "cf876f0b-ab5f-4601-8b21-c2127882b09f",
"metadata": {},
"source": [
"Let's try running more samples on the $\\epsilon=1e^{-4},\\ \\gamma=100$ case to see if it flattens out."
"Let's try running more samples on the $\\epsilon=10^{-4},\\ \\gamma=100$ case to see if it flattens out."
]
},
{
Expand Down Expand Up @@ -1003,7 +1003,7 @@
"id": "1b0fc954-8101-46b9-9d7d-a30f7dd93884",
"metadata": {},
"source": [
"### 5. Heuristics for selecting $\\epsilon$ and $\\gamma$"
"### 4. Heuristics for selecting $\\epsilon$ and $\\gamma$"
]
},
{
Expand All @@ -1028,18 +1028,18 @@
"id": "20177ce1-96f0-4b7c-ae5b-60aa579e0457",
"metadata": {},
"source": [
"### 6. Selecting $\\epsilon$ and $\\gamma$ in this MNIST example"
"### 5. Selecting $\\epsilon$ and $\\gamma$ in this MNIST example"
]
},
{
"cell_type": "markdown",
"id": "7c6b07e8-83be-4a14-a66e-7ae27d5ea902",
"metadata": {},
"source": [
"- $\\epsilon=1e^{-2}$ is definitely a no-go, since the values quickly diverge to NaN.\n",
"- $\\epsilon=1e^{-3}$ causes a big spike in the initial LLC estimation. This should be avoided.\n",
"- With enough draws, the LLC estimation for $\\epsilon=1e^{-4}, \\ \\gamma=1$ looks like it will converge nicely. The loss traces are basically flattened out as well, which is another indication that the LLC estimation should continue to converge without issue. In this first sweep, this would be my recommendation for hyperparameters.\n",
"- If more refinement is needed (e.g. it's necessary for LLC estimation to converge in a fewer number of draws), then another option would be to sweep with more granular values (say, a half order of magnitude) around $\\epsilon=1e^{-4}, \\ \\gamma=1$"
"- $\\epsilon=10^{-2}$ is definitely a no-go, since the values quickly diverge to NaN.\n",
"- $\\epsilon=10^{-3}$ causes a big spike in the initial LLC estimation. This should be avoided.\n",
"- With enough draws, the LLC estimation for $\\epsilon=10^{-4}, \\ \\gamma=1$ looks like it will converge nicely. The loss traces are basically flattened out as well, which is another indication that the LLC estimation should continue to converge without issue. In this first sweep, this would be my recommendation for hyperparameters.\n",
"- If more refinement is needed (e.g. it's necessary for LLC estimation to converge in a fewer number of draws), then another option would be to sweep with more granular values (say, a half order of magnitude) around $\\epsilon=10^{-4}, \\ \\gamma=1$"
]
}
],
Expand Down