Clustered standard errors differ between `DuckRegression` and `statsmodels`

When running OLS with clustered data, the standard errors from `DuckRegression` and `statsmodels` differ by a non-trivial amount.

Expected behavior:
- Point estimates match (or differ by a tiny amount)
- Standard errors match (or differ by a tiny amount)
 
Actual behavior:
- Point estimates match (or differ by a tiny amount)
- Standard error differ by a non-trivial amount. In my run with [example dataset](https://economictheoryblog.com/wp-content/uploads/2016/12/data.xls), `statsmodels` reports SE ≈ 0.032 for the slope and `duckreg` reports SE ≈ 0.0251.

<img width="298" height="69" alt="Image" src="https://github.com/user-attachments/assets/b20ef401-664d-44f4-9374-06cc4b9eaf04" />

What I tried: 
- Reproduced on various real data ([one example dataset](https://economictheoryblog.com/wp-content/uploads/2016/12/data.xls)).
- Reproduced on synthetic data (script below).
- Varying `n_bootstraps`

Replication with simulated data:

```
# synthetic_repro.py
import numpy as np
import pandas as pd
import duckdb
from duckreg.estimators import DuckRegression
import statsmodels.formula.api as smf

np.random.seed(123)
n_clusters = 50
cluster_size = 10
N = n_clusters * cluster_size

cluster_ids = np.repeat(np.arange(n_clusters), cluster_size)
class_size = np.random.normal(30, 5, size=N)
# cluster random intercept
u = np.repeat(np.random.normal(0, 2, size=n_clusters), cluster_size)
eps = np.random.normal(0, 5, size=N)
# outcome
id_score = 50 + 0.5 * class_size + u + eps

data = pd.DataFrame({
    "id_score": id_score,
    "class_size": class_size,
    "class_id": cluster_ids
})

# statsmodels
model = smf.ols("id_score ~ class_size", data=data).fit()
ms = model.get_robustcov_results(cov_type='cluster', groups=data['class_id'])
print("statsmodels cluster SE:\n", ms.summary())

# duckreg
conn = duckdb.connect("database.db")
conn.execute("DROP TABLE IF EXISTS test_data")
conn.execute("CREATE TABLE test_data AS SELECT * FROM data")
m = DuckRegression(
    db_name="database.db",
    table_name="test_data",
    formula="id_score ~ class_size",
    cluster_col="class_id",
    n_bootstraps=200,
    seed=21,
)
m.fit()
print("duckreg summary:")
display(m.summary())


pd.DataFrame([[ms.params[1], ms.bse[1]], [m.summary()['point_estimate'][1], m.summary()['standard_error'][1]]], columns=['Point estimate', 'Standard error'], index=['Statsmodels', 'DuckRegression'])

```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustered standard errors differ between `DuckRegression` and `statsmodels` #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clustered standard errors differ between DuckRegression and statsmodels #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Clustered standard errors differ between `DuckRegression` and `statsmodels` #19