timaeus-research · tcz · Feb 18, 2026
diff --git a/examples/dlns.ipynb b/examples/dlns.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Deep Linear Networks\n",
     "\n",
-    "Warning: This notebook is currently functional, but does not show actually reproduce the results it's attempting to. Use only as inspiration, not as gospel. For more well-calibrated LLC estimates of a simple linear network, see tests/slt/rrr_test.py.\n",
+    "Warning: This notebook is currently functional, but does not actually reproduce the results it's attempting to. Use only as inspiration, not as gospel. For more well-calibrated LLC estimates of a simple linear network, see tests/slt/rrr_test.py.\n",
     "\n",
     "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timaeus-research/devinterp/blob/main/examples/dlns.ipynb)\n",
     "\n",

diff --git a/examples/epsilon_beta.ipynb b/examples/epsilon_beta.ipynb
@@ -9,7 +9,7 @@
     "\n",
     "This notebook aims to visualize $\\hat{\\lambda}_n^\\beta$ for various values of $\\beta$ (inverse temperature) and $\\epsilon$ (step size). \n",
     "\n",
-    "Roughly (per Adrian Xu's master thesis):\n",
+    "Roughly (per Adrian Xu's master's thesis):\n",
     "- $\\beta$ can be tuned via graphing $\\hat{\\lambda}_n^\\beta$ for a sweep of $\\beta$, and using $\\beta$ in a range around the critical points on the graph.\n",
     "- $\\epsilon$ should be the greatest possible value that doesn't cause excessive numerical instability or cause the SGLD chains to fail to converge. An MALA proposal acceptance rate (see `sgld_calibration.ipynb`) between 0.9 and 0.95 is roughly optimal."
    ]

diff --git a/examples/grokking.ipynb b/examples/grokking.ipynb
@@ -19,7 +19,7 @@
    "source": [
     "This notebook aims to show how LLC estimation is calibrated in a simple modular addition grokking example, showing a moderately interesting result at the end.\n",
     "\n",
-    "We'll starting off with some standard grokking code, adapted loosely from Nina Panickssery and Dmitry Vaintrob's [modular addition learning coefficient post](https://www.alignmentforum.org/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition) and [github code repo](https://github.com/nrimsky/devinterp). (Thank you for your help!)"
+    "We'll start off with some standard grokking code, adapted loosely from Nina Panickssery and Dmitry Vaintrob's [modular addition learning coefficient post](https://www.alignmentforum.org/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition) and [github code repo](https://github.com/nrimsky/devinterp). (Thank you for your help!)"
    ]
   },
   {
@@ -426,7 +426,7 @@
     "id": "bP4h1UJEeP94"
    },
    "source": [
-    "In order to get LLC estimates for this simple grokking model over training, we first need to choose hyperparameters. The most important ones to calibrate are epsilon (the SGLD learning rate / step size) and n\\*beta (the effective inverse temperature). Let's run a quick sweep over a wide range of epsilon and n\\*beta, and look for a range of values within this which shows little change in LLC change in LLC values when we change epsilon and nbeta. We can use `devinterp.vis_utils.EpsilonBetaAnalyzer` for this."
+    "In order to get LLC estimates for this simple grokking model over training, we first need to choose hyperparameters. The most important ones to calibrate are epsilon (the SGLD learning rate / step size) and n\\*beta (the effective inverse temperature). Let's run a quick sweep over a wide range of epsilon and n\\*beta, and look for a range of values within this which shows little change in LLC values when we change epsilon and nbeta. We can use `devinterp.vis_utils.EpsilonBetaAnalyzer` for this."
    ]
   },
   {
@@ -650,7 +650,7 @@
     "id": "DlGfy4ZAeP-C"
    },
    "source": [
-    "From this, we can see that the effective sampled loss for low-ish nbetas (<100) shows very little dependence on the exact choice of nbeta. So let's a point in this flat region (~1), and a high-but-still-in-the-flat-region epsilon (0.03), so we don't need to run many draws, but still have little dependence of our samples on epsilon.\n",
+    "From this, we can see that the effective sampled loss for low-ish nbetas (<100) shows very little dependence on the exact choice of nbeta. So let's pick a point in this flat region (~1), and a high-but-still-in-the-flat-region epsilon (0.003), so we don't need to run many draws, but still have little dependence of our samples on epsilon.\n",
     "\n",
     "Let's check that the loss chain for these hyperparams looks decent, and then run LLC estimation on all trained checkpoints if it does."
    ]

diff --git a/examples/introduction.ipynb b/examples/introduction.ipynb
@@ -121,9 +121,9 @@
                 "\n",
                 "## Local Learning Coefficients\n",
                 "\n",
-                "The first method we have have online is local learning coefficient estimation ([Lau et al. 2023](https://arxiv.org/abs/2308.12108)). \n",
+                "The first method we have online is local learning coefficient estimation ([Lau et al. 2023](https://arxiv.org/abs/2308.12108)). \n",
                 "\n",
-                "For an in-depth explaination, see [this post](https://www.lesswrong.com/posts/6g8cAftfQufLmFDYT/you-re-counting-your-parameters-wrong). The short version is that: \n",
+                "For an in-depth explanation, see [this post](https://www.lesswrong.com/posts/6g8cAftfQufLmFDYT/you-re-counting-your-parameters-wrong). The short version is that: \n",
                 "- The (local) learning coefficient $\\hat\\lambda$ is the \"correct\" measure of model complexity. Besides the loss, it's the most principled high-level way to compare models.\n",
                 "- We can cheaply estimate the learning coefficient associated to a choice of weights $\\hat w^*$ by using the following formula:\n",
                 "\n",
@@ -354,7 +354,7 @@
                 "\n",
                 "Below you'll see what's actually happening when you run `local_learning_coefficients`.\n",
                 "\n",
-                "We sample 10 different chains, with the same starting positions but different batch schedules and noise realizations at each step. For each of these chains, we take 200 steps using SGLD. We observe the loss at each of these points. At the end, we average the loss across chains, compare it to the initial loss, and apply a correction that depends on the dataset size to get the local learning coefficient. \n",
+                "We sample 3 different chains, with the same starting positions but different batch schedules and noise realizations at each step. For each of these chains, we take 100 steps using SGLD. We observe the loss at each of these points. At the end, we average the loss across chains, compare it to the initial loss, and apply a correction that depends on the dataset size to get the local learning coefficient. \n",
                 "\n",
                 "For a healthy chain, the Loss Trace should increase rapidly at first and then level off."
             ]
@@ -406,7 +406,7 @@
                 "-  [`mnist.ipynb`](../examples/mnist.ipynb) showing how we can use LLC estimation to assess relative LLCs of MNIST models trained with different optimizers.\n",
                 "- [`sgld_calibration.ipynb`](../examples/sgld_calibration.ipynb) shows how to gain confidence in using SGLD-based LLC estimation to a model with unknown LLC.\n",
                 "- [`diagnostics.ipynb`](../examples/diagnostics.ipynb) shows how to use callbacks to diagnose if your sampling is going well.\n",
-                "- [`epsilon_beta.ipynb`](../examples/epsilon_beta.ipynb) shows how to use use a callback to calibrate SGLD hyperparameters.\n",
+                "- [`epsilon_beta.ipynb`](../examples/epsilon_beta.ipynb) shows how to use a callback to calibrate SGLD hyperparameters.\n",
                 "\n",
                 "For a small demo of how to use the library to study grokking, see [`grokking.ipynb`](../examples/grokking.ipynb)."
             ]
@@ -432,7 +432,7 @@
                 "- **Progress measures**. If you have an understanding of some structure at the end of training, you can roll that understanding backwards to track how that structure develops over time. \n",
                 "- **Probes**. Similarly, you can train a linear probe from activations onto features, then roll that probe back to previous checkpoints to measure how those features are learned. \n",
                 "- **Gradients**. Just look at the gradients! \n",
-                "- **Evals**. You can measure performance on a targeted benchmarks to track when the model learns the associated capabilities. \n",
+                "- **Evals**. You can measure performance on targeted benchmarks to track when the model learns the associated capabilities. \n",
                 "- **Covariance estimators**. That's a secret for now. More coming soon!"
             ]
         }

diff --git a/examples/mnist.ipynb b/examples/mnist.ipynb
@@ -182,7 +182,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def emtpy_func():\n",
+    "def empty_func():\n",
     "    return (), ()\n",
     "\n",
     "\n",
@@ -198,7 +198,7 @@
     "        if model_key == \"sgd\":\n",
     "            optimizer.step()\n",
     "        else:\n",
-    "            optimizer.step(emtpy_func, model, criterion)\n",
+    "            optimizer.step(empty_func, model, criterion)\n",
     "    return train_loss / len(train_loader)\n",
     "\n",
     "\n",

diff --git a/examples/sgld_calibration.ipynb b/examples/sgld_calibration.ipynb
@@ -818,7 +818,7 @@
    "id": "9d7e0ef8",
    "metadata": {},
    "source": [
-    "Judging by this, $\\epsilon = 1e^{-2}$ is out (did not converge), and a $\\gamma$ of $1$ is too low. A higher MALA acceptance prob would be better (ideally we'd aim for $.9$) but that might not be possible for this model. The higher learning rate is generally preferred, but we have to be careful not to get a thermalization peak at the start of sampling. Let's take a look at the loss curves next to check if our sampling works as expected."
+    "Judging by this, $\\epsilon = 10^{-2}$ is out (did not converge), and a $\\gamma$ of $1$ is too low. A higher MALA acceptance prob would be better (ideally we'd aim for $.9$) but that might not be possible for this model. The higher learning rate is generally preferred, but we have to be careful not to get a thermalization peak at the start of sampling. Let's take a look at the loss curves next to check if our sampling works as expected."
    ]
   },
   {
@@ -926,7 +926,7 @@
    "id": "cf876f0b-ab5f-4601-8b21-c2127882b09f",
    "metadata": {},
    "source": [
-    "Let's try running more samples on the $\\epsilon=1e^{-4},\\ \\gamma=100$ case to see if it flattens out."
+    "Let's try running more samples on the $\\epsilon=10^{-4},\\ \\gamma=100$ case to see if it flattens out."
    ]
   },
   {
@@ -1003,7 +1003,7 @@
    "id": "1b0fc954-8101-46b9-9d7d-a30f7dd93884",
    "metadata": {},
    "source": [
-    "### 5. Heuristics for selecting $\\epsilon$ and $\\gamma$"
+    "### 4. Heuristics for selecting $\\epsilon$ and $\\gamma$"
    ]
   },
   {
@@ -1028,18 +1028,18 @@
    "id": "20177ce1-96f0-4b7c-ae5b-60aa579e0457",
    "metadata": {},
    "source": [
-    "### 6. Selecting $\\epsilon$ and $\\gamma$ in this MNIST example"
+    "### 5. Selecting $\\epsilon$ and $\\gamma$ in this MNIST example"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "7c6b07e8-83be-4a14-a66e-7ae27d5ea902",
    "metadata": {},
    "source": [
-    "- $\\epsilon=1e^{-2}$ is definitely a no-go, since the values quickly diverge to NaN.\n",
-    "- $\\epsilon=1e^{-3}$ causes a big spike in the initial LLC estimation. This should be avoided.\n",
-    "- With enough draws, the LLC estimation for $\\epsilon=1e^{-4}, \\ \\gamma=1$ looks like it will converge nicely. The loss traces are basically flattened out as well, which is another indication that the LLC estimation should continue to converge without issue. In this first sweep, this would be my recommendation for hyperparameters.\n",
-    "- If more refinement is needed (e.g. it's necessary for LLC estimation to converge in a fewer number of draws), then another option would be to sweep with more granular values (say, a half order of magnitude) around $\\epsilon=1e^{-4}, \\ \\gamma=1$"
+    "- $\\epsilon=10^{-2}$ is definitely a no-go, since the values quickly diverge to NaN.\n",
+    "- $\\epsilon=10^{-3}$ causes a big spike in the initial LLC estimation. This should be avoided.\n",
+    "- With enough draws, the LLC estimation for $\\epsilon=10^{-4}, \\ \\gamma=1$ looks like it will converge nicely. The loss traces are basically flattened out as well, which is another indication that the LLC estimation should continue to converge without issue. In this first sweep, this would be my recommendation for hyperparameters.\n",
+    "- If more refinement is needed (e.g. it's necessary for LLC estimation to converge in a fewer number of draws), then another option would be to sweep with more granular values (say, a half order of magnitude) around $\\epsilon=10^{-4}, \\ \\gamma=1$"
    ]
   }
  ],