A Python library for exploring linear regression and multiple linear regression using matrix algebra and ordinary least squares (OLS) estimation.
Darbro is a lightweight educational library designed to help understand the mathematical foundations of linear regression. It implements regression analysis using matrix operations, providing insight into the underlying statistical calculations.
- Simple and Multiple Linear Regression: Supports both single and multiple predictor variables
- Matrix-Based Calculations: Uses numpy for efficient matrix operations
- Comprehensive Statistics: Calculates coefficients, residuals, MSE, SSE, hat matrix, and variance-covariance matrix
- Hypothesis Testing: t-statistics, p-values for coefficients, and F-test for overall model significance
- Model Fit Metrics: R² and adjusted R² for evaluating model performance
- Statistical Summary: Formatted regression results table similar to R or statsmodels
- Prediction: Make predictions on new data points
- Confidence Intervals: Confidence intervals for mean response and prediction intervals for individual observations
- CSV Integration: Easy data loading from CSV files
- Educational Focus: Clear implementation showing the mathematical operations
Ensure you have the required dependencies:
pip install numpy pandas scipyThen clone or download the repository and import the Darbro class.
import numpy as np
from darbro import Darbro
# Load data from CSV
df = Darbro.read_csv('bodyfat.csv', 'bodyfat', ['triceps', 'thigh', 'midarm'])
# Create Darbro instance
model = Darbro(df)
# Calculate analytical information
model.calculate_analytical_information()
# View coefficients
print("Coefficients:", model.coefficients)
# Make a prediction
new_data = np.array([25.0, 50.0, 28.0])
prediction = model.predict(new_data)
print("Prediction:", prediction)Darbro(dataframe)Parameters:
dataframe(pandas.DataFrame): A DataFrame where the first column is the dependent variable (y) and remaining columns are predictor variables (X)
Automatically Calculated:
coefficients: Regression coefficients (β) including interceptX: Design matrix with intercept column addedy: Dependent variable vector
After calling calculate_analytical_information():
All statistical measures and hypothesis tests are computed automatically.
Example:
import pandas as pd
from darbro import Darbro
# Create DataFrame (first column is y, rest are X variables)
data = pd.DataFrame({
'y': [1, 2, 3, 4, 5],
'x1': [2, 4, 6, 8, 10],
'x2': [1, 3, 5, 7, 9]
})
model = Darbro(data)After calling calculate_analytical_information(), the following attributes are available:
Basic Statistics:
coefficients(numpy.ndarray): Regression coefficients [β₀, β₁, β₂, ...]residuals(numpy.ndarray): Residuals (y - ŷ)fitted(numpy.ndarray): Fitted values (ŷ)hat_matrix(numpy.ndarray): Hat matrix (H = X(X'X)⁻¹X')mse(float): Mean Squared Error (SSE / (n - p))sse(float): Sum of Squared Errorsvariance_covariance(numpy.ndarray): Variance-covariance matrix of coefficients
Hypothesis Testing:
t_statistics(numpy.ndarray): t-statistics for each coefficientp_values(numpy.ndarray): Two-tailed p-values for each coefficientstandard_errors(numpy.ndarray): Standard errors of coefficientsf_statistic(float): F-statistic for overall model significancef_p_value(float): p-value for F-statistic
Model Fit:
r_squared(float): R² (coefficient of determination)adjusted_r_squared(float): Adjusted R² (penalized for number of predictors)
Calculates regression coefficients using the OLS formula: β = (X'X)⁻¹X'y
Returns: numpy.ndarray of coefficients
Called automatically in constructor.
Calculates the hat matrix: H = X(X'X)⁻¹X'
Returns: numpy.ndarray (hat matrix)
The hat matrix projects y onto the fitted values: ŷ = Hy
Calculates all analytical statistics including:
- Hat matrix
- Residuals and fitted values
- Sum of Squared Errors (SSE) and Mean Squared Error (MSE)
- Variance-covariance matrix
- R² and adjusted R²
- t-statistics, p-values, and standard errors for coefficients
- F-statistic and p-value for overall model significance
Must be called before accessing statistical measures.
Example:
model = Darbro(df)
model.calculate_analytical_information()
print("R²:", model.r_squared)
print("F-statistic:", model.f_statistic)
print("P-values:", model.p_values)Returns a formatted summary of regression results similar to R or statsmodels output.
Returns: str (formatted summary table)
Must be called after calculate_analytical_information().
Example:
model = Darbro(df)
model.calculate_analytical_information()
print(model.get_summary())Output:
======================================================================
Regression Results
======================================================================
R-squared: 0.8014
Adjusted R-squared: 0.7641
F-statistic: 21.5157
Prob (F-statistic): 7.3433e-06
Mean Squared Error: 6.1503
Sum Squared Error: 98.4049
No. Observations: 20
Df Residuals: 16
Df Model: 3
======================================================================
Coefficients:
----------------------------------------------------------------------
Variable Coef Std Err t P>|t|
----------------------------------------------------------------------
Intercept 117.0847 99.7824 1.173 0.2578
triceps 4.3341 3.0155 1.437 0.1699
thigh -2.8568 2.5820 -1.106 0.2849
midarm -2.1861 1.5955 -1.370 0.1896
======================================================================
Makes a prediction for new data.
Parameters:
X_new(numpy.ndarray): New predictor values (without intercept)
Returns: float (predicted value)
Example:
# For a model with 3 predictors
prediction = model.predict(np.array([25.0, 50.0, 28.0]))Calculates a confidence interval for the mean response E[Y|X].
This interval estimates where the average value of Y is likely to fall for a given X, with a specified confidence level.
Parameters:
X_new(numpy.ndarray): New predictor values (without intercept)alpha(float): Significance level (default=0.05 for 95% confidence)
Returns: tuple (lower_bound, point_estimate, upper_bound)
Must be called after calculate_analytical_information().
Example:
model = Darbro(df)
model.calculate_analytical_information()
# Get 95% confidence interval for mean response
lower, prediction, upper = model.confidence_interval_mean(np.array([25.0, 50.0, 28.0]))
print(f"95% CI for mean: [{lower:.2f}, {upper:.2f}]")Calculates a prediction interval for a new individual observation Y|X.
This interval estimates where a single new observation of Y is likely to fall for a given X. It is wider than the confidence interval for the mean because it accounts for both the uncertainty in estimating the mean and the random variation of individual observations.
Parameters:
X_new(numpy.ndarray): New predictor values (without intercept)alpha(float): Significance level (default=0.05 for 95% confidence)
Returns: tuple (lower_bound, point_estimate, upper_bound)
Must be called after calculate_analytical_information().
Example:
model = Darbro(df)
model.calculate_analytical_information()
# Get 95% prediction interval for new observation
lower, prediction, upper = model.prediction_interval(np.array([25.0, 50.0, 28.0]))
print(f"95% PI for new observation: [{lower:.2f}, {upper:.2f}]")Makes a prediction with both confidence and prediction intervals in one call.
Parameters:
X_new(numpy.ndarray): New predictor values (without intercept)alpha(float): Significance level (default=0.05 for 95% confidence)
Returns: dict containing:
prediction: Point estimateconfidence_interval: Tuple (lower, upper) for mean responseprediction_interval: Tuple (lower, upper) for new observationconfidence_level: Confidence level as percentage
Must be called after calculate_analytical_information().
Example:
model = Darbro(df)
model.calculate_analytical_information()
results = model.predict_with_intervals(np.array([25.0, 50.0, 28.0]))
print(f"Prediction: {results['prediction']:.2f}")
print(f"95% CI: {results['confidence_interval']}")
print(f"95% PI: {results['prediction_interval']}")Reads a CSV file and extracts specified columns.
Parameters:
csv_path(str): Path to CSV filey_column(str): Name of dependent variable columnpredictor_columns(list): List of predictor variable column names
Returns: pandas.DataFrame with y as first column, predictors as remaining columns
Example:
df = Darbro.read_csv('data.csv', 'price', ['sqft', 'bedrooms', 'age'])import pandas as pd
import numpy as np
from darbro import Darbro
# Create simple dataset
data = pd.DataFrame({
'sales': [10, 15, 20, 25, 30],
'advertising': [1, 2, 3, 4, 5]
})
# Fit model
model = Darbro(data)
model.calculate_analytical_information()
# Display results
print("Intercept (β₀):", model.coefficients[0])
print("Slope (β₁):", model.coefficients[1])
print("R²:", model.r_squared)
print("F-statistic p-value:", model.f_p_value)
# Make predictions
new_advertising = np.array([6.0])
predicted_sales = model.predict(new_advertising)
print(f"Predicted sales for advertising=6: {predicted_sales}")import numpy as np
from darbro import Darbro
# Load body fat data
df = Darbro.read_csv('bodyfat.csv', 'bodyfat', ['triceps', 'thigh', 'midarm'])
# Create model
model = Darbro(df)
model.calculate_analytical_information()
# Display comprehensive results
print("=== Regression Coefficients ===")
print(f"Intercept: {model.coefficients[0]:.3f}")
print(f"Triceps: {model.coefficients[1]:.3f}")
print(f"Thigh: {model.coefficients[2]:.3f}")
print(f"Midarm: {model.coefficients[3]:.3f}")
print("\n=== Model Statistics ===")
print(f"Sum of Squared Errors (SSE): {model.sse:.3f}")
print(f"Mean Squared Error (MSE): {model.mse:.3f}")
print(f"Root MSE: {np.sqrt(model.mse):.3f}")
print("\n=== Hypothesis Tests ===")
print(f"R²: {model.r_squared:.4f}")
print(f"Adjusted R²: {model.adjusted_r_squared:.4f}")
print(f"F-statistic: {model.f_statistic:.3f}")
print(f"F p-value: {model.f_p_value:.6f}")
print("\n=== Coefficient Tests ===")
var_names = ['Intercept', 'Triceps', 'Thigh', 'Midarm']
for i, name in enumerate(var_names):
print(f"{name}: coef={model.coefficients[i]:.3f}, "
f"SE={model.standard_errors[i]:.3f}, "
f"t={model.t_statistics[i]:.3f}, "
f"p={model.p_values[i]:.4f}")
# Make prediction for new individual
new_person = np.array([25.0, 50.0, 28.0]) # triceps, thigh, midarm
predicted_bodyfat = model.predict(new_person)
print(f"\n=== Prediction ===")
print(f"Predicted body fat: {predicted_bodyfat:.2f}%")
# Examine residuals
print("\n=== Residual Analysis ===")
print(f"Sum of residuals: {np.sum(model.residuals):.6f} (should be ~0)")
print(f"Min residual: {np.min(model.residuals):.3f}")
print(f"Max residual: {np.max(model.residuals):.3f}")import numpy as np
from darbro import Darbro
df = Darbro.read_csv('bodyfat.csv', 'bodyfat', ['triceps', 'thigh', 'midarm'])
model = Darbro(df)
model.calculate_analytical_information()
# The hat matrix "puts the hat on y" to get ŷ
print("Hat matrix shape:", model.hat_matrix.shape)
print("Hat matrix diagonal (leverage values):")
print(np.diag(model.hat_matrix))
# Verify: H * y = ŷ
fitted_via_hat = model.hat_matrix @ model.y.values
print("\nVerification - Fitted values match:")
print(np.allclose(fitted_via_hat, model.fitted))from darbro import Darbro
import numpy as np
df = Darbro.read_csv('bodyfat.csv', 'bodyfat', ['triceps', 'thigh', 'midarm'])
model = Darbro(df)
model.calculate_analytical_information()
print("Variance-Covariance Matrix:")
print(model.variance_covariance)
print("\nStandard Errors of Coefficients:")
std_errors = np.sqrt(np.diag(model.variance_covariance))
for i, se in enumerate(std_errors):
var_name = ['Intercept', 'Triceps', 'Thigh', 'Midarm'][i] if i < 4 else f'β{i}'
print(f"{var_name}: {se:.4f}")
print("\nCorrelation between coefficient estimates:")
corr_matrix = np.corrcoef(model.variance_covariance)
print(corr_matrix)from darbro import Darbro
import numpy as np
# Load data and fit model
df = Darbro.read_csv('bodyfat.csv', 'bodyfat', ['triceps', 'thigh', 'midarm'])
model = Darbro(df)
model.calculate_analytical_information()
# New observation to predict
new_person = np.array([25.0, 50.0, 28.0]) # triceps, thigh, midarm
# Method 1: Get both intervals at once
results = model.predict_with_intervals(new_person, alpha=0.05)
print("=== Prediction with Intervals ===")
print(f"Point Estimate: {results['prediction']:.2f}%")
print(f"95% Confidence Interval for mean: [{results['confidence_interval'][0]:.2f}, {results['confidence_interval'][1]:.2f}]")
print(f"95% Prediction Interval for individual: [{results['prediction_interval'][0]:.2f}, {results['prediction_interval'][1]:.2f}]")
# Method 2: Get intervals separately
print("\n=== Individual Method Calls ===")
# Confidence interval for the mean response
ci_lower, prediction, ci_upper = model.confidence_interval_mean(new_person)
print(f"\nConfidence Interval (Mean Response):")
print(f" We are 95% confident that the average body fat percentage")
print(f" for people with these measurements is between {ci_lower:.2f}% and {ci_upper:.2f}%")
# Prediction interval for a new observation
pi_lower, prediction, pi_upper = model.prediction_interval(new_person)
print(f"\nPrediction Interval (New Observation):")
print(f" We are 95% confident that a single person with these measurements")
print(f" will have a body fat percentage between {pi_lower:.2f}% and {pi_upper:.2f}%")
# Different confidence levels
print("\n=== Different Confidence Levels ===")
for conf_level in [0.90, 0.95, 0.99]:
alpha = 1 - conf_level
ci_lower, pred, ci_upper = model.confidence_interval_mean(new_person, alpha=alpha)
pi_lower, pred, pi_upper = model.prediction_interval(new_person, alpha=alpha)
print(f"\n{int(conf_level*100)}% Intervals:")
print(f" Confidence Interval: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f" Prediction Interval: [{pi_lower:.2f}, {pi_upper:.2f}]")Key Differences:
- Confidence Interval: Estimates the average response for a given X (narrower)
- Prediction Interval: Estimates a single new observation for a given X (wider)
- The prediction interval is always wider because it accounts for individual variation
The regression model is: y = Xβ + ε
Where:
- y is the n×1 vector of dependent variable observations
- X is the n×p design matrix (including intercept column)
- β is the p×1 vector of coefficients
- ε is the n×1 vector of errors
- Coefficients: β = (X'X)⁻¹X'y
- Fitted Values: ŷ = Xβ = Hy, where H is the hat matrix
- Hat Matrix: H = X(X'X)⁻¹X'
- Residuals: e = y - ŷ
- Sum of Squared Errors: SSE = Σ(yᵢ - ŷᵢ)² = e'e
- Mean Squared Error: MSE = SSE / (n - p)
- Variance-Covariance Matrix: Var(β) = MSE × (X'X)⁻¹
- Educational: Understanding the matrix algebra behind regression
- Statistical Learning: Exploring relationships between variables
- Prediction: Forecasting continuous outcomes
- Feature Analysis: Understanding which predictors are important
- Research: Quick regression analysis for small to medium datasets
- No regularization (Ridge, Lasso)
- No residual diagnostic plots
- Assumes no severe multicollinearity issues
- No automatic handling of categorical variables
- Requires manual data preprocessing
Input DataFrame must have:
- First column: Dependent variable (y)
- Remaining columns: Predictor variables (X₁, X₂, ..., Xₚ)
The intercept is added automatically.
This is an educational library. Suggestions for improvements are welcome!
Planned features for future versions:
Hypothesis testing (t-statistics, p-values, F-test)✅ ImplementedR² and adjusted R²✅ ImplementedConfidence intervals for predictions✅ Implemented- Residual diagnostics and plots
- Support for polynomial regression
- Cross-validation utilities
- Logistic regression
- Other statistical tests and methods
Open source - free to use and modify for educational purposes.
The included bodyfat.csv contains data on body fat percentage and body measurements (triceps, thigh, midarm skinfold measurements) for 20 subjects, commonly used in regression textbooks.
- Linear regression theory and OLS estimation
- Matrix formulation of regression analysis
- Statistical inference in linear models
Darbro - Explore the mathematics of linear regression!