The riskdiff package provides robust methods for calculating risk differences (also known as prevalence differences in cross-sectional studies) using generalized linear models with automatic link function selection and boundary detection.
riskdiff v0.2.1 is now published on CRAN with cutting-edge boundary detection capabilities that identify when maximum likelihood estimates lie at the edge of the parameter space - a common issue with identity link models that other packages ignore.
- π― Smart boundary detection: Automatically detects when GLMs hit parameter constraints
- π§ Robust model fitting: Tries identity, log, and logit links with graceful fallback\
- π Stratified analysis: Support for multi-level stratification
- π Publication-ready output: Formatted tables and confidence intervals
- π‘οΈ Missing data handling: Graceful handling of incomplete cases
- βοΈ Flexible confidence intervals: Robust methods for boundary cases
- π Multiple link functions: Automatic selection with boundary-aware switching
- π Transparent diagnostics: Clear reporting of model methods and boundary issues
John D. Murphy, MPH, PhD ORCID: 0000-0002-7714-9976
Install the latest stable version from CRAN:
install.packages("riskdiff")You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("jackmurphy2351/riskdiff")library(riskdiff)
# Load example data
data(cachar_sample)
# Simple risk difference with boundary detection
result <- calc_risk_diff(
data = cachar_sample,
outcome = "abnormal_screen",
exposure = "smoking"
)
#> Waiting for profiling to be done...
print(result)
#> Risk Difference Analysis Results (v0.2.1)
#> =========================================
#>
#> Confidence level: 95%
#> Number of comparisons: 1
#>
#> Exposure Risk Difference 95% CI P-value Model N Boundary
#> smoking 10.68% (5.95%, 15.75%) <0.001 identity 2500
#> Quality
#> Good# Create data that challenges standard GLM methods
set.seed(123)
challenging_data <- data.frame(
outcome = c(rep(0, 40), rep(1, 60)), # High baseline risk
exposure = factor(c(rep("No", 50), rep("Yes", 50))),
age = rnorm(100, 45, 10)
)
# riskdiff handles this gracefully with boundary detection
result <- calc_risk_diff(
data = challenging_data,
outcome = "outcome",
exposure = "exposure",
adjust_vars = "age",
verbose = TRUE # Shows diagnostic information
)
#> Warning in doTryCatch(return(expr), name, parentenv, handler): Possible
#> separation detected. Risk difference estimates may be unstable.
#> Formula: outcome ~ exposure + age
#> Sample size: 100
#> Trying identity link...
#> Using starting values: 0.2, 0.8, 0.004
#> Identity link error: cannot find valid starting values: please specify some
#> Trying log link...
#> log link error: no valid set of coefficients has been found: please supply starting values
#> Trying logit link...
#> Boundary detection results:
#> Boundary detected: TRUE
#> Boundary type: upper_boundary_near
#> Probability range: [0.17876, 1]
#> Note: Model converged but MLE is on parameter space boundary.
#> Boundary type: upper_boundary_near
#> βlogit link converged
#> β Perfect or quasi-perfect separation detected. Results may be unreliable.
#> β Boundary case detected: upper_boundary_near
#> Using robust CI method: bootstrap
print(result)
#> Risk Difference Analysis Results (v0.2.1)
#> =========================================
#>
#> Confidence level: 95%
#> Number of comparisons: 1
#> Boundary cases detected: 1 of 1
#> Boundary CI method: auto
#>
#> Exposure Risk Difference 95% CI P-value Model N
#> exposure 80.06% (0.00%, 0.00%) β logit 100
#> Boundary Quality
#> β upper_boundary_near Boundary
#>
#> β Boundary Case Details:
#> =========================
#> Row 1 ( exposure ): Type: upper_boundary_near | CI method: bootstrap
#>
#> Boundary Type Guide:
#> β’ upper_bound: Fitted probabilities near 1 (risk saturation)
#> β’ lower_bound: Fitted probabilities near 0 (very rare outcomes)
#> β’ separation: Complete/quasi-separation detected
#> β’ both_bounds: Mixed boundary issues across observations
#> β indicates robust confidence intervals were used
#>
#> Statistical Note:
#> ================
#> Standard asymptotic theory may not apply for boundary cases.
#> Confidence intervals use robust methods when boundary detected.
#> For failed analyses, consider alternative estimation approaches.
# Check if boundary cases were detected
if (any(result$on_boundary)) {
cat("\nπ¨ Boundary case detected! Using robust inference methods.\n")
cat("Boundary type:", unique(result$boundary_type[result$on_boundary]), "\n")
cat("CI method:", unique(result$ci_method[result$on_boundary]), "\n")
}
#>
#> π¨ Boundary case detected! Using robust inference methods.
#> Boundary type: upper_boundary_near
#> CI method: bootstrap# Age-adjusted risk difference with boundary detection
rd_adjusted <- calc_risk_diff(
data = cachar_sample,
outcome = "abnormal_screen",
exposure = "smoking",
adjust_vars = "age",
boundary_method = "auto" # Automatic robust method selection
)
print(rd_adjusted)
#> Risk Difference Analysis Results (v0.2.1)
#> =========================================
#>
#> Confidence level: 95%
#> Number of comparisons: 1
#>
#> Exposure Risk Difference 95% CI P-value Model N Boundary Quality
#> smoking 10.94% (0.00%, 0.00%) β logit 2500 Good# Stratified by residence with boundary detection
rd_stratified <- calc_risk_diff(
data = cachar_sample,
outcome = "abnormal_screen",
exposure = "smoking",
adjust_vars = "age",
strata = "residence"
)
#> Waiting for profiling to be done...
print(rd_stratified)
#> Risk Difference Analysis Results (v0.2.1)
#> =========================================
#>
#> Confidence level: 95%
#> Number of comparisons: 3
#>
#> Exposure Risk Difference 95% CI P-value Model N Boundary
#> smoking 11.63% (0.00%, 0.00%) β logit 2158
#> smoking 9.99% (-5.89%, 25.87%) 0.218 identity 251
#> smoking -3.86% (0.00%, 0.00%) β log 91
#> Quality
#> Good
#> Wide CI
#> Good
# Summary of boundary cases across strata
boundary_summary <- rd_stratified[rd_stratified$on_boundary,
c("residence", "boundary_type", "ci_method")]
if (nrow(boundary_summary) > 0) {
cat("\nBoundary cases by stratum:\n")
print(boundary_summary)
}# Create a simple text table with boundary information
cat(create_simple_table(rd_stratified, "Risk by Smoking Status and Residence"))
#> Risk by Smoking Status and Residence
#> ====================================================================================
#> Exposure Risk Diff 95% CI P-value Model
#> ====================================================================================
#> smoking 11.63% (0.00%, 0.00%) NA logit
#> smoking 9.99% (-5.89%, 25.87%) 0.218 identity
#> smoking -3.86% (0.00%, 0.00%) NA log
#> ====================================================================================# Create publication-ready table (requires kableExtra)
library(kableExtra)
create_rd_table(rd_stratified,
caption = "Risk of Abnormal Screening Result by Smoking Status",
include_model_type = TRUE)The package uses generalized linear models with different link functions:
- Identity link (preferred): Directly estimates risk differences
- Log link: Estimates relative risks, transforms to risk differences\
- Logit link: Estimates odds ratios, transforms to risk differences
Key Innovation: When models hit parameter space boundaries (common with identity links), the package: - π Detects boundary cases automatically -
- π‘οΈ Uses robust confidence intervals when appropriate - π Reports methodology transparently
- Upper bound: Fitted probabilities near 1 (risk saturation)
- Lower bound: Fitted probabilities near 0 (risk floor)
- Separation: Complete/quasi-separation in logistic models
- Both bounds: Multiple boundary issues detected
# Force specific boundary handling
rd_conservative <- calc_risk_diff(
cachar_sample,
"abnormal_screen",
"smoking",
boundary_method = "auto" # Options: "auto", "profile", "wald"
)
#> Waiting for profiling to be done...
# Check which methods were used
table(rd_conservative$ci_method)
#>
#> profile
#> 1# Force a specific link function
rd_logit <- calc_risk_diff(
cachar_sample,
"abnormal_screen",
"smoking",
link = "logit"
)
# Check which model was used and if boundaries detected
cat("Model used:", rd_logit$model_type, "\n")
#> Model used: logit
cat("Boundary detected:", rd_logit$on_boundary, "\n")
#> Boundary detected: FALSE# 90% confidence intervals with boundary detection
rd_90 <- calc_risk_diff(
cachar_sample,
"abnormal_screen",
"smoking",
alpha = 0.10 # 1 - 0.10 = 90% CI
)
#> Waiting for profiling to be done...
print(rd_90)
#> Risk Difference Analysis Results (v0.2.1)
#> =========================================
#>
#> Confidence level: 90%
#> Number of comparisons: 1
#>
#> Exposure Risk Difference 95% CI P-value Model N Boundary
#> smoking 10.68% (6.68%, 14.91%) <0.001 identity 2500
#> Quality
#> Good
# The package automatically uses appropriate CI methods for boundary cases# Examine the enhanced result structure
data(cachar_sample)
result <- calc_risk_diff(cachar_sample, "abnormal_screen", "smoking")
#> Waiting for profiling to be done...
names(result)
#> [1] "exposure_var" "rd" "ci_lower" "ci_upper"
#> [5] "p_value" "model_type" "on_boundary" "boundary_type"
#> [9] "boundary_warning" "ci_method" "n_obs"
# Key columns:
# - on_boundary: Was a boundary case detected?
# - boundary_type: What type of boundary?
# - boundary_warning: Detailed diagnostic message
# - ci_method: Which CI method was used?The package includes a realistic simulated cancer screening dataset:
data(cachar_sample)
str(cachar_sample)
#> 'data.frame': 2500 obs. of 12 variables:
#> $ id : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ age : int 53 25 18 28 51 25 56 20 58 18 ...
#> $ sex : Factor w/ 2 levels "male","female": 2 1 2 2 1 2 1 1 1 1 ...
#> $ residence : Factor w/ 3 levels "rural","urban",..: 3 1 1 1 1 1 1 1 1 1 ...
#> $ smoking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
#> $ tobacco_chewing : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 1 2 1 2 2 ...
#> $ areca_nut : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 2 1 2 2 ...
#> $ alcohol : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 2 ...
#> $ abnormal_screen : int 0 0 0 0 0 0 1 0 1 0 ...
#> $ head_neck_abnormal: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ age_group : Factor w/ 3 levels "Under 40","40-60",..: 2 1 1 1 2 1 2 1 2 1 ...
#> $ tobacco_areca_both: Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 2 2 ...
# Summary statistics showing realistic associations
table(cachar_sample$smoking, cachar_sample$abnormal_screen)
#>
#> 0 1
#> No 1851 317
#> Yes 248 84
# Risk difference analysis
rd_analysis <- calc_risk_diff(cachar_sample, "abnormal_screen", "smoking")
#> Waiting for profiling to be done...
cat("Smoking increases risk of abnormal screening result by",
sprintf("%.1f", rd_analysis$rd * 100), "percentage points\n")
#> Smoking increases risk of abnormal screening result by 10.7 percentage pointsRisk differences are particularly valuable when:
- Policy decisions: You need the absolute impact size
- Clinical practice: Communicating real-world effect sizes
- Common outcomes: When outcome prevalence > 10%
- Causal inference: For intervention planning
- Public health: When relative measures can mislead
| Measure | Interpretation | Best When | riskdiff Advantage |
|---|---|---|---|
| Risk Difference | Absolute change in risk | Common outcomes, policy | Boundary detection |
| Risk Ratio | Relative change in risk | Rare outcomes | Standard methods only |
| Odds Ratio | Change in odds | Case-control studies | Standard methods only |
This package implements methods based on:
- Donoghoe & Marschner (2018) - Robust GLM fitting methods for log-binomial models
- Marschner & Gillett (2012) - Boundary detection for log-binomial models
- Rothman, Greenland & Lash (2008) - Modern epidemiological methods
- Austin (2011) - Propensity score methods for causal inference
- HernΓ‘n & Robins (2020) - Causal inference methodology
- π Vignettes:
browseVignettes("riskdiff") - π Bug reports: GitHub Issues
- π‘ Feature requests: GitHub Issues
- π§ Questions: Use GitHub Discussions
- π CRAN page: https://CRAN.R-project.org/package=riskdiff
If you use this package in your research, please cite:
citation("riskdiff")- epitools: Basic epidemiological calculations (no boundary detection)
- epi: Extended epidemiological functions (no boundary detection)
- fmsb: Medical statistics and epidemiology (no boundary detection)
- Epi: Statistical analysis in epidemiology (no boundary detection)
riskdiff uniquely provides boundary detection for robust inference!
Please note that the riskdiff project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
