Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 151 additions & 35 deletions docs/tutorials/formula_interface/formula_interface.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"\n",
"This tutorial reimplements and extends the combined frequency-severity model from Chapter 4 of the [GLM tutorial](tutorials/glm_french_motor_tutorial/glm_french_motor.html). If you would like to know more about the setting, the data, or GLM modeling in general, please check that out first.\n",
"\n",
"**Sneak Peak**\n",
"**Sneak Peek**\n",
"\n",
"Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:\n",
"\n",
Expand Down Expand Up @@ -52,13 +52,19 @@
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:11.302339Z",
"iopub.status.busy": "2026-03-16T16:10:11.302010Z",
"iopub.status.idle": "2026-03-16T16:10:12.926504Z",
"shell.execute_reply": "2026-03-16T16:10:12.926009Z"
}
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import pytest\n",
"import scipy.optimize as optimize\n",
"import scipy.stats\n",
"from dask_ml.preprocessing import Categorizer\n",
Expand All @@ -83,7 +89,14 @@
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:12.928178Z",
"iopub.status.busy": "2026-03-16T16:10:12.927994Z",
"iopub.status.idle": "2026-03-16T16:10:15.244947Z",
"shell.execute_reply": "2026-03-16T16:10:15.244591Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -365,7 +378,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Reproducing the Model From the GLM Turorial<a class=\"anchor\"></a>\n",
"## 2. Reproducing the Model From the GLM Tutorial<a class=\"anchor\"></a>\n",
"\n",
"Now, let us start by fitting a very simple model. As usual, let's divide our samples into a training and a test set so that we get valid out-of-sample goodness-of-fit measures. Perhaps less usually, we do not create separate `y` and `X` data frames for our label and features – the formula will take care of that for us.\n",
"\n",
Expand All @@ -379,7 +392,14 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:15.259806Z",
"iopub.status.busy": "2026-03-16T16:10:15.259694Z",
"iopub.status.idle": "2026-03-16T16:10:15.471707Z",
"shell.execute_reply": "2026-03-16T16:10:15.470808Z"
}
},
"outputs": [],
"source": [
"ss = ShuffleSplit(n_splits=1, test_size=0.1, random_state=42)\n",
Expand All @@ -398,13 +418,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different prefictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand."
"This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different predictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:15.473697Z",
"iopub.status.busy": "2026-03-16T16:10:15.473584Z",
"iopub.status.idle": "2026-03-16T16:10:22.769294Z",
"shell.execute_reply": "2026-03-16T16:10:22.768802Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -539,22 +566,29 @@
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:22.770437Z",
"iopub.status.busy": "2026-03-16T16:10:22.770347Z",
"iopub.status.idle": "2026-03-16T16:10:22.819876Z",
"shell.execute_reply": "2026-03-16T16:10:22.819390Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"ClaimNb int64\n",
"Exposure float64\n",
"Area object\n",
"Area str\n",
"VehPower int64\n",
"VehAge int64\n",
"DrivAge int64\n",
"BonusMalus int64\n",
"VehBrand object\n",
"VehGas object\n",
"VehBrand str\n",
"VehGas str\n",
"Density int64\n",
"Region object\n",
"Region str\n",
"ClaimAmount float64\n",
"ClaimAmountCut float64\n",
"PurePremium float64\n",
Expand All @@ -577,13 +611,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a caetgorical variable, it does not have any effect outside of the feature name."
"Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a categorical variable, it does not have any effect outside of the feature name."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:22.821622Z",
"iopub.status.busy": "2026-03-16T16:10:22.821528Z",
"iopub.status.idle": "2026-03-16T16:10:30.355393Z",
"shell.execute_reply": "2026-03-16T16:10:30.355061Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -719,13 +760,20 @@
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:30.356740Z",
"iopub.status.busy": "2026-03-16T16:10:30.356657Z",
"iopub.status.idle": "2026-03-16T16:10:30.401890Z",
"shell.execute_reply": "2026-03-16T16:10:30.401512Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([303.77443311, 548.47789523, 244.34438579, ..., 109.81572865,\n",
" 67.98332028, 297.21717383])"
" 67.98332028, 297.21717383], shape=(67802,))"
]
},
"execution_count": 7,
Expand All @@ -743,15 +791,22 @@
"source": [
"## 4. Interactions and Structural Full-Rankness<a class=\"anchor\"></a>\n",
"\n",
"One of the biggest strengths of Wilkinson-formuals lie in their ability of concisely specifying interactions between terms. `glum` implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with `glum`'s categorical handling in general, you can be assured that `glum` you don't have to worry too much about the actual implementation, and can expect that `glum` will do the most efficient thing behind the scenes.\n",
"One of the biggest strengths of Wilkinson-formulas lie in their ability of concisely specifying interactions between terms. `glum` implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with `glum`'s categorical handling in general, you can be assured that you don't have to worry too much about the actual implementation, and can expect that `glum` will do the most efficient thing behind the scenes.\n",
"\n",
"Let's see how that looks like on the insurance example! Suppose that we expect `VehPower` to have a different effect depending on `DrivAge` (e.g. performance cars might not be great for new drivers, but may be less problematic for more experienced ones). We can include the interaction of these variables as follows."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:30.403259Z",
"iopub.status.busy": "2026-03-16T16:10:30.403170Z",
"iopub.status.idle": "2026-03-16T16:10:38.273339Z",
"shell.execute_reply": "2026-03-16T16:10:38.272842Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -891,7 +946,14 @@
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:38.274479Z",
"iopub.status.busy": "2026-03-16T16:10:38.274413Z",
"iopub.status.idle": "2026-03-16T16:10:38.276592Z",
"shell.execute_reply": "2026-03-16T16:10:38.276208Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -976,15 +1038,22 @@
" 2. Include the logarithm of a certain variable in the model.\n",
" 3. Include a basis spline interpolation of a variable to capture non-linearities in its effect.\n",
"\n",
"1\\. works because because formulas can contain [Python operations](https://matthewwardrop.github.io/formulaic/guides/grammar/). 2. and 3. work because formulas are evaluated within a context that is aware of a number of [transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/). To be precise, 2. is a regular transform and 3. is a stateful transform.\n",
"1\\. works because formulas can contain [Python operations](https://matthewwardrop.github.io/formulaic/guides/grammar/). 2. and 3. work because formulas are evaluated within a context that is aware of a number of [transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/). To be precise, 2. is a regular transform and 3. is a stateful transform.\n",
"\n",
"Let's try it out!"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:38.277724Z",
"iopub.status.busy": "2026-03-16T16:10:38.277661Z",
"iopub.status.idle": "2026-03-16T16:10:47.196866Z",
"shell.execute_reply": "2026-03-16T16:10:47.196455Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -1118,7 +1187,14 @@
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:47.197980Z",
"iopub.status.busy": "2026-03-16T16:10:47.197903Z",
"iopub.status.idle": "2026-03-16T16:10:47.783163Z",
"shell.execute_reply": "2026-03-16T16:10:47.782765Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -1201,13 +1277,20 @@
"source": [
"### Variable Names\n",
"\n",
"`glum`'s formula interface provides a lot of control over how the resulting features are named. By default, it follows `formulaic`'s standards, but it can be customized by setting the `interaction_separator` and `categorical_format` paremeters."
"`glum`'s formula interface provides a lot of control over how the resulting features are named. By default, it follows `formulaic`'s standards, but it can be customized by setting the `interaction_separator` and `categorical_format` parameters."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:47.784411Z",
"iopub.status.busy": "2026-03-16T16:10:47.784343Z",
"iopub.status.idle": "2026-03-16T16:10:50.289682Z",
"shell.execute_reply": "2026-03-16T16:10:50.289219Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -1345,12 +1428,27 @@
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:50.290873Z",
"iopub.status.busy": "2026-03-16T16:10:50.290798Z",
"iopub.status.idle": "2026-03-16T16:10:50.305162Z",
"shell.execute_reply": "2026-03-16T16:10:50.304698Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Caught expected ValueError: The formula sets the intercept to False, contradicting fit_intercept=True. You should use fit_intercept to specify the intercept.\n"
]
}
],
"source": [
"formula_noint = \"PurePremium ~ DrivAge * VehPower - 1\"\n",
"\n",
"with pytest.raises(ValueError, match=\"The formula sets the intercept to False\"):\n",
"try:\n",
" t_glm8 = GeneralizedLinearRegressor(\n",
" family=TweedieDist,\n",
" alpha_search=True,\n",
Expand All @@ -1359,7 +1457,11 @@
" formula=formula_noint,\n",
" interaction_separator=\"__x__\",\n",
" categorical_format=\"{name}__{category}\",\n",
" ).fit(df_train, sample_weight=df[\"Exposure\"].values[train])"
" ).fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n",
" raise AssertionError(\"Expected ValueError was not raised\")\n",
"except ValueError as e:\n",
" assert \"The formula sets the intercept to False\" in str(e)\n",
" print(f\"Caught expected ValueError: {e}\")"
]
},
{
Expand All @@ -1374,7 +1476,14 @@
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:50.306335Z",
"iopub.status.busy": "2026-03-16T16:10:50.306261Z",
"iopub.status.idle": "2026-03-16T16:10:52.806289Z",
"shell.execute_reply": "2026-03-16T16:10:52.805883Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -1515,8 +1624,15 @@
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2026-03-16T16:10:52.807428Z",
"iopub.status.busy": "2026-03-16T16:10:52.807354Z",
"iopub.status.idle": "2026-03-16T16:10:54.748684Z",
"shell.execute_reply": "2026-03-16T16:10:54.748153Z"
}
},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -1628,6 +1744,7 @@
"\n",
"t_glm9 = GeneralizedLinearRegressor(\n",
" family=TweedieDist,\n",
" \n",
" alpha_search=True,\n",
" l1_ratio=1,\n",
" fit_intercept=False,\n",
Expand Down Expand Up @@ -1661,9 +1778,8 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
},
"orig_nbformat": 4
"version": "3.14.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
Expand Down
Loading
Loading