Quantco · Matthias Schmidtblaicher (MatthiasSchmidtblaicherQC) · Mar 18, 2026 · Mar 16, 2026 · Mar 16, 2026
@@ -18,7 +18,7 @@
     "\n",
     "This tutorial reimplements and extends the combined frequency-severity model from Chapter 4 of the [GLM tutorial](tutorials/glm_french_motor_tutorial/glm_french_motor.html). If you would like to know more about the setting, the data, or GLM modeling in general, please check that out first.\n",
     "\n",
-    "**Sneak Peak**\n",
+    "**Sneak Peek**\n",
     "\n",
     "Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:\n",
     "\n",
@@ -52,13 +52,19 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:11.302339Z",
+     "iopub.status.busy": "2026-03-16T16:10:11.302010Z",
+     "iopub.status.idle": "2026-03-16T16:10:12.926504Z",
+     "shell.execute_reply": "2026-03-16T16:10:12.926009Z"
+    }
+   },
    "outputs": [],
    "source": [
     "import matplotlib.pyplot as plt\n",
     "import numpy as np\n",
     "import pandas as pd\n",
-    "import pytest\n",
     "import scipy.optimize as optimize\n",
     "import scipy.stats\n",
     "from dask_ml.preprocessing import Categorizer\n",
@@ -83,7 +89,14 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:12.928178Z",
+     "iopub.status.busy": "2026-03-16T16:10:12.927994Z",
+     "iopub.status.idle": "2026-03-16T16:10:15.244947Z",
+     "shell.execute_reply": "2026-03-16T16:10:15.244591Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -365,7 +378,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 2. Reproducing the Model From the GLM Turorial<a class=\"anchor\"></a>\n",
+    "## 2. Reproducing the Model From the GLM Tutorial<a class=\"anchor\"></a>\n",
     "\n",
     "Now, let us start by fitting a very simple model. As usual, let's divide our samples into a training and a test set so that we get valid out-of-sample goodness-of-fit measures. Perhaps less usually, we do not create separate `y` and `X` data frames for our label and features – the formula will take care of that for us.\n",
     "\n",
@@ -379,7 +392,14 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:15.259806Z",
+     "iopub.status.busy": "2026-03-16T16:10:15.259694Z",
+     "iopub.status.idle": "2026-03-16T16:10:15.471707Z",
+     "shell.execute_reply": "2026-03-16T16:10:15.470808Z"
+    }
+   },
    "outputs": [],
    "source": [
     "ss = ShuffleSplit(n_splits=1, test_size=0.1, random_state=42)\n",
@@ -398,13 +418,20 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different prefictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand."
+    "This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different predictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 4,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:15.473697Z",
+     "iopub.status.busy": "2026-03-16T16:10:15.473584Z",
+     "iopub.status.idle": "2026-03-16T16:10:22.769294Z",
+     "shell.execute_reply": "2026-03-16T16:10:22.768802Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -539,22 +566,29 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:22.770437Z",
+     "iopub.status.busy": "2026-03-16T16:10:22.770347Z",
+     "iopub.status.idle": "2026-03-16T16:10:22.819876Z",
+     "shell.execute_reply": "2026-03-16T16:10:22.819390Z"
+    }
+   },
    "outputs": [
     {
      "data": {
       "text/plain": [
        "ClaimNb             int64\n",
        "Exposure          float64\n",
-       "Area               object\n",
+       "Area                  str\n",
        "VehPower            int64\n",
        "VehAge              int64\n",
        "DrivAge             int64\n",
        "BonusMalus          int64\n",
-       "VehBrand           object\n",
-       "VehGas             object\n",
+       "VehBrand              str\n",
+       "VehGas                str\n",
        "Density             int64\n",
-       "Region             object\n",
+       "Region                str\n",
        "ClaimAmount       float64\n",
        "ClaimAmountCut    float64\n",
        "PurePremium       float64\n",
@@ -577,13 +611,20 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a caetgorical variable, it does not have any effect outside of the feature name."
+    "Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a categorical variable, it does not have any effect outside of the feature name."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 6,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:22.821622Z",
+     "iopub.status.busy": "2026-03-16T16:10:22.821528Z",
+     "iopub.status.idle": "2026-03-16T16:10:30.355393Z",
+     "shell.execute_reply": "2026-03-16T16:10:30.355061Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -719,13 +760,20 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:30.356740Z",
+     "iopub.status.busy": "2026-03-16T16:10:30.356657Z",
+     "iopub.status.idle": "2026-03-16T16:10:30.401890Z",
+     "shell.execute_reply": "2026-03-16T16:10:30.401512Z"
+    }
+   },
    "outputs": [
     {
      "data": {
       "text/plain": [
        "array([303.77443311, 548.47789523, 244.34438579, ..., 109.81572865,\n",
-       "        67.98332028, 297.21717383])"
+       "        67.98332028, 297.21717383], shape=(67802,))"
       ]
      },
      "execution_count": 7,
@@ -743,15 +791,22 @@
    "source": [
     "## 4. Interactions and Structural Full-Rankness<a class=\"anchor\"></a>\n",
     "\n",
-    "One of the biggest strengths of Wilkinson-formuals lie in their ability of concisely specifying interactions between terms. `glum` implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with `glum`'s categorical handling in general, you can be assured that `glum` you don't have to worry too much about the actual implementation, and can expect that `glum` will do the most efficient thing behind the scenes.\n",
+    "One of the biggest strengths of Wilkinson-formulas lie in their ability of concisely specifying interactions between terms. `glum` implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with `glum`'s categorical handling in general, you can be assured that you don't have to worry too much about the actual implementation, and can expect that `glum` will do the most efficient thing behind the scenes.\n",
     "\n",
     "Let's see how that looks like on the insurance example! Suppose that we expect `VehPower` to have a different effect depending on `DrivAge` (e.g. performance cars might not be great for new drivers, but may be less problematic for more experienced ones). We can include the interaction of these variables as follows."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 8,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:30.403259Z",
+     "iopub.status.busy": "2026-03-16T16:10:30.403170Z",
+     "iopub.status.idle": "2026-03-16T16:10:38.273339Z",
+     "shell.execute_reply": "2026-03-16T16:10:38.272842Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -891,7 +946,14 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:38.274479Z",
+     "iopub.status.busy": "2026-03-16T16:10:38.274413Z",
+     "iopub.status.idle": "2026-03-16T16:10:38.276592Z",
+     "shell.execute_reply": "2026-03-16T16:10:38.276208Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -976,15 +1038,22 @@
     " 2. Include the logarithm of a certain variable in the model.\n",
     " 3. Include a basis spline interpolation of a variable to capture non-linearities in its effect.\n",
     "\n",
-    "1\\. works because because formulas can contain [Python operations](https://matthewwardrop.github.io/formulaic/guides/grammar/). 2. and 3. work because formulas are evaluated within a context that is aware of a number of [transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/). To be precise, 2. is a regular transform and 3. is a stateful transform.\n",
+    "1\\. works because formulas can contain [Python operations](https://matthewwardrop.github.io/formulaic/guides/grammar/). 2. and 3. work because formulas are evaluated within a context that is aware of a number of [transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/). To be precise, 2. is a regular transform and 3. is a stateful transform.\n",
     "\n",
     "Let's try it out!"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 10,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:38.277724Z",
+     "iopub.status.busy": "2026-03-16T16:10:38.277661Z",
+     "iopub.status.idle": "2026-03-16T16:10:47.196866Z",
+     "shell.execute_reply": "2026-03-16T16:10:47.196455Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -1118,7 +1187,14 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:47.197980Z",
+     "iopub.status.busy": "2026-03-16T16:10:47.197903Z",
+     "iopub.status.idle": "2026-03-16T16:10:47.783163Z",
+     "shell.execute_reply": "2026-03-16T16:10:47.782765Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -1201,13 +1277,20 @@
    "source": [
     "### Variable Names\n",
     "\n",
-    "`glum`'s formula interface provides a lot of control over how the resulting features are named. By default, it follows `formulaic`'s standards, but it can be customized by setting the `interaction_separator` and `categorical_format` paremeters."
+    "`glum`'s formula interface provides a lot of control over how the resulting features are named. By default, it follows `formulaic`'s standards, but it can be customized by setting the `interaction_separator` and `categorical_format` parameters."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 12,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:47.784411Z",
+     "iopub.status.busy": "2026-03-16T16:10:47.784343Z",
+     "iopub.status.idle": "2026-03-16T16:10:50.289682Z",
+     "shell.execute_reply": "2026-03-16T16:10:50.289219Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -1345,12 +1428,27 @@
   {
    "cell_type": "code",
    "execution_count": 13,
-   "metadata": {},
-   "outputs": [],
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:50.290873Z",
+     "iopub.status.busy": "2026-03-16T16:10:50.290798Z",
+     "iopub.status.idle": "2026-03-16T16:10:50.305162Z",
+     "shell.execute_reply": "2026-03-16T16:10:50.304698Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Caught expected ValueError: The formula sets the intercept to False, contradicting fit_intercept=True. You should use fit_intercept to specify the intercept.\n"
+     ]
+    }
+   ],
    "source": [
     "formula_noint = \"PurePremium ~ DrivAge * VehPower - 1\"\n",
     "\n",
-    "with pytest.raises(ValueError, match=\"The formula sets the intercept to False\"):\n",
+    "try:\n",
     "    t_glm8 = GeneralizedLinearRegressor(\n",
     "        family=TweedieDist,\n",
     "        alpha_search=True,\n",
@@ -1359,7 +1457,11 @@
     "        formula=formula_noint,\n",
     "        interaction_separator=\"__x__\",\n",
     "        categorical_format=\"{name}__{category}\",\n",
-    "    ).fit(df_train, sample_weight=df[\"Exposure\"].values[train])"
+    "    ).fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n",
+    "    raise AssertionError(\"Expected ValueError was not raised\")\n",
+    "except ValueError as e:\n",
+    "    assert \"The formula sets the intercept to False\" in str(e)\n",
+    "    print(f\"Caught expected ValueError: {e}\")"
    ]
   },
   {
@@ -1374,7 +1476,14 @@
   {
    "cell_type": "code",
    "execution_count": 14,
-   "metadata": {},
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:50.306335Z",
+     "iopub.status.busy": "2026-03-16T16:10:50.306261Z",
+     "iopub.status.idle": "2026-03-16T16:10:52.806289Z",
+     "shell.execute_reply": "2026-03-16T16:10:52.805883Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -1515,8 +1624,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {},
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-03-16T16:10:52.807428Z",
+     "iopub.status.busy": "2026-03-16T16:10:52.807354Z",
+     "iopub.status.idle": "2026-03-16T16:10:54.748684Z",
+     "shell.execute_reply": "2026-03-16T16:10:54.748153Z"
+    }
+   },
    "outputs": [
     {
      "data": {
@@ -1628,6 +1744,7 @@
     "\n",
     "t_glm9 = GeneralizedLinearRegressor(\n",
     "    family=TweedieDist,\n",
+    "    \n",
     "    alpha_search=True,\n",
     "    l1_ratio=1,\n",
     "    fit_intercept=False,\n",
@@ -1661,9 +1778,8 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.2"
-  },
-  "orig_nbformat": 4
+   "version": "3.14.3"
+  }
  },
  "nbformat": 4,
  "nbformat_minor": 2