Getting Started with Risk Differences

 library(riskdiff)
 library(dplyr)
 #> 
 #> Attaching package: 'dplyr'
 #> The following objects are masked from 'package:stats':
 #> 
 #> filter, lag
 #> The following objects are masked from 'package:base':
 #> 
 #> intersect, setdiff, setequal, union
 library(ggplot2)

Why Risk Differences Matter in Public Health

When communicating health risks to policymakers, patients, or the public, absolute measures like risk differences are often more meaningful than relative measures like odds ratios or risk ratios.

Consider these two statements about betel nut (areca nut) chewing and cancer risk:

"Betel nut chewing increases the odds of cancer by 5.2 times"
"Betel nut chewing increases cancer risk by 12 percentage points"

The second statement is immediately actionable: in a screening group of 1,000 people, you would expect to find approximately 120 additional cancer cases among betel nut chewers compared to non-chewers. This directly informs:

Resource allocation for screening programs
Expected yield from targeted interventions
Number needed to screen calculations
Public health messaging priorities

Key Concept: Risk differences represent the additional burden of disease attributable to an exposure. They are on the same scale as the outcome, making them intuitive for non-statistical audiences.

Understanding the Example Data

The riskdiff package includes example data from a cancer screening program in Northeast India:

 data(cachar_sample)
 
 # Basic structure
 glimpse(cachar_sample)
 #> Rows: 2,500
 #> Columns: 12
 #> $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
 #> $ age <int> 53, 25, 18, 28, 51, 25, 56, 20, 58, 18, 30, 64, 54,...
 #> $ sex <fct> female, male, female, female, male, female, male, m...
 #> $ residence <fct> urban slum, rural, rural, rural, rural, rural, rura...
 #> $ smoking <fct> No, No, No, No, No, No, Yes, No, No, No, No, Yes, N...
 #> $ tobacco_chewing <fct> Yes, No, No, Yes, Yes, No, Yes, No, Yes, Yes, Yes, ...
 #> $ areca_nut <fct> Yes, Yes, Yes, Yes, No, No, Yes, No, Yes, Yes, No, ...
 #> $ alcohol <fct> No, No, No, No, No, No, No, Yes, No, Yes, No, No, N...
 #> $ abnormal_screen <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
 #> $ head_neck_abnormal <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
 #> $ age_group <fct> 40-60, Under 40, Under 40, Under 40, 40-60, Under 4...
 #> $ tobacco_areca_both <fct> Yes, No, No, Yes, No, No, Yes, No, Yes, Yes, No, Ye...
 
 # Key variables for our analysis
cachar_sample %>%
 select(abnormal_screen, areca_nut, smoking, alcohol, age, sex, residence) %>%
 summary()
 #> abnormal_screen areca_nut smoking alcohol age 
 #> Min. :0.0000 No : 778 No :2168 No :1962 Min. :18.00 
 #> 1st Qu.:0.0000 Yes:1722 Yes: 332 Yes: 538 1st Qu.:23.00 
 #> Median :0.0000 Median :32.00 
 #> Mean :0.1604 Mean :36.06 
 #> 3rd Qu.:0.0000 3rd Qu.:49.00 
 #> Max. :1.0000 Max. :84.00 
 #> sex residence 
 #> male :1880 rural :2158 
 #> female: 620 urban : 251 
 #> urban slum: 91 
 #> 
 #> 
 #>

This dataset represents a cross-sectional screening study where:

abnormal_screen = 1 indicates an abnormal cancer screening result
areca_nut indicates areca (or betel) nut chewing status
Other variables capture demographics and risk factors

Your First Risk Difference Analysis

Let’s calculate the risk difference for cancer associated with betel nut chewing:

 # Simple unadjusted analysis
rd_simple <- calc_risk_diff(
 data = cachar_sample,
 outcome = "abnormal_screen",
 exposure = "areca_nut"
)
 #> Waiting for profiling to be done...
 #> Warning: Unknown or uninitialised column: `on_boundary`.
 
 print(rd_simple)
 #> Risk Difference Analysis Results (v0.1.0) 
 #> ========================================= 
 #> 
 #> Confidence level: 95% 
 #> Number of comparisons: 1 
 #> 
 #> Exposure Risk Difference 95% CI P-value Model N
 #> areca_nut 18.25% (15.89%, 20.57%) <0.001 identity 2500

Understanding the Output

The output shows:

rd: The risk difference (e.g., 0.082 = 8.2 percentage points)
ci_lower, ci_upper: 95% confidence interval bounds
p_value: Test of whether the risk difference equals zero
model_type: Which GLM link function successfully converged
n_obs: Number of observations used in the analysis

Interpreting Risk Differences

 # Let's visualize what this risk difference means
exposure_summary <- cachar_sample %>%
 group_by(areca_nut) %>%
 summarise(
 n = n(),
 cases = sum(abnormal_screen),
 risk = mean(abnormal_screen),
 se = sqrt(risk * (1 - risk) / n)
 ) %>%
 mutate(
 risk_percent = risk * 100,
 se_percent = se * 100
 )
 
 print(exposure_summary)
 #> # A tibble: 2 ×ばつ 7
 #> areca_nut n cases risk se risk_percent se_percent
 #> <fct> <int> <int> <dbl> <dbl> <dbl> <dbl>
 #> 1 No 778 27 0.0347 0.00656 3.47 0.656
 #> 2 Yes 1722 374 0.217 0.00994 21.7 0.994
 
 # The risk difference is simply:
rd_value <- diff(exposure_summary$risk)
 cat("Risk difference:", round(rd_value * 100, 1), "percentage points\n")
 #> Risk difference: 18.2 percentage points

Adjusted Analyses

Real-world associations are often confounded. Let’s adjust for age and sex:

 # Age and sex adjusted analysis
rd_adjusted <- calc_risk_diff(
 data = cachar_sample,
 outcome = "abnormal_screen",
 exposure = "areca_nut",
 adjust_vars = c("age", "sex")
)
 #> Warning: Unknown or uninitialised column: `on_boundary`.
 print(rd_adjusted)
 #> Risk Difference Analysis Results (v0.1.0) 
 #> ========================================= 
 #> 
 #> Confidence level: 95% 
 #> Number of comparisons: 1 
 #> 
 #> Exposure Risk Difference 95% CI P-value Model N
 #> areca_nut 17.02% (2.66%, 31.38%) — log 2500
 
 # Show comparison in a readable format
 cat("\n=== COMPARISON: UNADJUSTED vs ADJUSTED ===\n")
 #> 
 #> === COMPARISON: UNADJUSTED vs ADJUSTED ===
 cat("Unadjusted Risk Difference:", sprintf("%.2f%%", rd_simple$rd * 100), 
 sprintf("(%.2f%%, %.2f%%)", rd_simple$ci_lower * 100, rd_simple$ci_upper * 100), "\n")
 #> Unadjusted Risk Difference: 18.25% (15.89%, 20.57%)
 cat("Adjusted Risk Difference: ", sprintf("%.2f%%", rd_adjusted$rd * 100), 
 sprintf("(%.2f%%, %.2f%%)", rd_adjusted$ci_lower * 100, rd_adjusted$ci_upper * 100), "\n")
 #> Adjusted Risk Difference: 17.02% (2.66%, 31.38%)
 cat("Difference in estimates: ", sprintf("%.2f%%", (rd_adjusted$rd - rd_simple$rd) * 100), "percentage points\n")
 #> Difference in estimates: -1.23% percentage points

Note: Adjustment often changes the risk difference estimate. This indicates that age and/or sex were confounders of the chewing-cancer association.

Stratified Analysis

Sometimes we want to know if effects differ across subgroups:

 # Stratified by residence (urban vs rural)
rd_stratified <- calc_risk_diff(
 data = cachar_sample,
 outcome = "abnormal_screen",
 exposure = "areca_nut",
 strata = "residence"
)
 #> Waiting for profiling to be done...
 #> Warning in .transform_to_rd_robust(model, data, exposure, alpha, n_boot = 500,
 #> : Confidence interval capped at ±50 percentage points due to extreme width.
 #> Waiting for profiling to be done...
 #> Warning: Unknown or uninitialised column: `on_boundary`.
 #> Warning in calc_risk_diff(data = cachar_sample, outcome = "abnormal_screen", :
 #> 1 of 3 estimates have very wide CIs (>50 pp). Consider larger sample sizes.
 
 print(rd_stratified)
 #> Risk Difference Analysis Results (v0.1.0) 
 #> ========================================= 
 #> 
 #> Confidence level: 95% 
 #> Number of comparisons: 3 
 #> 
 #> Exposure Risk Difference 95% CI P-value Model N
 #> areca_nut 18.89% (16.28%, 21.46%) <0.001 identity 2158
 #> areca_nut 12.50% (-37.50%, 62.50%) — log 251
 #> areca_nut 16.96% (3.07%, 30.85%) 0.017 identity 91

This shows separate risk differences for urban and rural areas, which might reflect:

Different exposure patterns
Access to healthcare
Environmental co-factors

Visualizing Your Results

A picture is worth a thousand p-values:

 
 # Stratified analysis by residence
rd_stratified <- calc_risk_diff(
 data = cachar_sample,
 outcome = "abnormal_screen",
 exposure = "areca_nut",
 strata = "residence"
)
 #> Waiting for profiling to be done...
 #> Warning in .transform_to_rd_robust(model, data, exposure, alpha, n_boot = 500,
 #> : Confidence interval capped at ±50 percentage points due to extreme width.
 #> Waiting for profiling to be done...
 #> Warning: Unknown or uninitialised column: `on_boundary`.
 #> Warning in calc_risk_diff(data = cachar_sample, outcome = "abnormal_screen", :
 #> 1 of 3 estimates have very wide CIs (>50 pp). Consider larger sample sizes.
 
 print(rd_stratified)
 #> Risk Difference Analysis Results (v0.1.0) 
 #> ========================================= 
 #> 
 #> Confidence level: 95% 
 #> Number of comparisons: 3 
 #> 
 #> Exposure Risk Difference 95% CI P-value Model N
 #> areca_nut 18.89% (16.28%, 21.46%) <0.001 identity 2158
 #> areca_nut 12.50% (-37.50%, 62.50%) — log 251
 #> areca_nut 16.96% (3.07%, 30.85%) 0.017 identity 91
 
 # Check if estimates are stable enough for visualization
rd_summary <- rd_stratified %>%
 summarise(
 n_valid = sum(!is.na(rd)),
 max_ci_width = max((ci_upper - ci_lower) * 100, na.rm = TRUE),
 .groups = "drop"
 )
 
 # Only create plot if we have reasonable estimates
 if (rd_summary$n_valid > 0 && rd_summary$max_ci_width < 200) {
 # Use the fixed forest plot function
 plot_obj <- create_forest_plot(
 rd_stratified, 
 title = "Risk Difference for Cancer by Areca Nut Chewing",
 max_ci_width = 30
 )
 
 if (!is.null(plot_obj)) {
 print(plot_obj)
 } else {
 cat(paste0(riskdiff:::.safe_warning()), "Could not create plot due to unstable estimates.\n")
 }
} else {
 cat(paste0(riskdiff:::.safe_warning()), "Estimates too unstable for meaningful visualization.\n")
 cat("Consider:\n")
 cat(paste0(riskdiff:::.safe_bullet()), "Larger sample sizes\n") 
 cat(paste0(riskdiff:::.safe_bullet()), "Different stratification variables\n")
 cat(paste0(riskdiff:::.safe_bullet()), "Pooled analysis instead of stratification\n")
 
 # Show tabular results instead using the robust summary function
 formatted_results <- create_summary_table(
 rd_stratified, 
 caption = "Risk Differences by Residence"
 )
 
 knitr::kable(formatted_results, caption = "Risk Differences by Residence")
}

Common Pitfalls and Solutions

1. Model Convergence Issues

The identity link GLM (which directly estimates risk differences) often fails to converge. The riskdiff package handles this automatically:

 # Force different link functions
rd_identity <- calc_risk_diff(
 cachar_sample, "abnormal_screen", "areca_nut", 
 link = "identity"
)
 #> Waiting for profiling to be done...
 #> Warning: Unknown or uninitialised column: `on_boundary`.
 
rd_log <- calc_risk_diff(
 cachar_sample, "abnormal_screen", "areca_nut", 
 link = "log"
)
 #> Warning: Unknown or uninitialised column: `on_boundary`.
 
 # Compare model types used
 cat("Identity link model type:", rd_identity$model_type, "\n")
 #> Identity link model type: identity
 cat("Log link model type:", rd_log$model_type, "\n")
 #> Log link model type: log

2. Very Rare Outcomes

With rare outcomes, risk differences become very small:

 # Create a rare outcome (1% prevalence)
cachar_sample$rare_outcome <- rbinom(nrow(cachar_sample), 1, 0.01)
 
rd_rare <- calc_risk_diff(
 cachar_sample, 
 "rare_outcome", 
 "areca_nut"
)
 #> Waiting for profiling to be done...
 #> Warning: Unknown or uninitialised column: `on_boundary`.
 
 print(rd_rare)
 #> Risk Difference Analysis Results (v0.1.0) 
 #> ========================================= 
 #> 
 #> Confidence level: 95% 
 #> Number of comparisons: 1 
 #> 
 #> Exposure Risk Difference 95% CI P-value Model N
 #> areca_nut 0.55% (-0.42%, 1.39%) 0.214 identity 2500

Tip: For very rare outcomes (<1%), consider whether risk ratios might be more appropriate for your research question.

3. Missing Data

The package automatically handles missing data by complete case analysis:

 # Create a copy with some missing data for demonstration
 set.seed(123) # For reproducibility
cachar_with_missing <- cachar_sample %>%
 mutate(
 # Introduce more modest missing data (~3% in age, ~2% in alcohol)
 age_with_missing = ifelse(runif(n()) < 0.03, NA, age),
 alcohol_with_missing = ifelse(runif(n()) < 0.02, NA, alcohol)
 )
 
 # Check the missing data patterns
missing_summary <- cachar_with_missing %>%
 summarise(
 total_observations = n(),
 age_missing = sum(is.na(age_with_missing)),
 alcohol_missing = sum(is.na(alcohol_with_missing)),
 total_missing_any = sum(!complete.cases(select(., age_with_missing, alcohol_with_missing, abnormal_screen, areca_nut))),
 complete_cases = sum(complete.cases(select(., age_with_missing, alcohol_with_missing, abnormal_screen, areca_nut)))
 )
 
 print(missing_summary)
 #> total_observations age_missing alcohol_missing total_missing_any
 #> 1 2500 83 57 140
 #> complete_cases
 #> 1 2360
 
 # Analysis with variables that have missing data
rd_missing <- calc_risk_diff(
 cachar_with_missing,
 "abnormal_screen",
 "areca_nut",
 adjust_vars = c("age_with_missing", "alcohol_with_missing")
)
 #> Warning: Unknown or uninitialised column: `on_boundary`.
 
 # Compare with complete case analysis
rd_complete <- calc_risk_diff(
 cachar_sample,
 "abnormal_screen", 
 "areca_nut",
 adjust_vars = c("age", "alcohol")
)
 #> Warning: Unknown or uninitialised column: `on_boundary`.
 
 cat("\n=== IMPACT OF MISSING DATA ===\n")
 #> 
 #> === IMPACT OF MISSING DATA ===
 cat("Complete data analysis (n=", rd_complete$n_obs, "): ", sprintf("%.2f%%", rd_complete$rd * 100), 
 sprintf(" (%.2f%%, %.2f%%)", rd_complete$ci_lower * 100, rd_complete$ci_upper * 100), "\n")
 #> Complete data analysis (n= 2500 ): 17.16% (2.75%, 31.57%)
 
 # Check if missing data analysis succeeded
 if (!is.na(rd_missing$rd)) {
 cat("Missing data analysis (n=", rd_missing$n_obs, "): ", sprintf("%.2f%%", rd_missing$rd * 100), 
 sprintf(" (%.2f%%, %.2f%%)", rd_missing$ci_lower * 100, rd_missing$ci_upper * 100), "\n")
 cat("Cases lost to missing data: ", rd_complete$n_obs - rd_missing$n_obs, "\n")
} else {
 cat("Missing data analysis: FAILED (insufficient data or convergence issues)\n")
 cat("Attempted to use n =", rd_missing$n_obs, "complete cases\n")
 cat("Cases lost to missing data: ", nrow(cachar_with_missing) - rd_missing$n_obs, "\n\n")
 
 cat("LESSON: This demonstrates why missing data can be problematic:\n")
 cat(paste0(riskdiff:::.safe_bullet()), "Listwise deletion can dramatically reduce sample size\n")
 cat(paste0(riskdiff:::.safe_bullet()), "Small samples may cause model convergence failures\n") 
 cat(paste0(riskdiff:::.safe_bullet()), "Consider multiple imputation for better missing data handling\n")
 cat(paste0(riskdiff:::.safe_bullet()), "The riskdiff package gracefully handles these failures\n")
}
 #> Missing data analysis: FAILED (insufficient data or convergence issues)
 #> Attempted to use n = 2500 complete cases
 #> Cases lost to missing data: 0 
 #> 
 #> LESSON: This demonstrates why missing data can be problematic:
 #> • Listwise deletion can dramatically reduce sample size
 #> • Small samples may cause model convergence failures
 #> • Consider multiple imputation for better missing data handling
 #> • The riskdiff package gracefully handles these failures

Quick Reference

Basic Syntax

 # Example usage:
result <- calc_risk_diff(
 data = cachar_sample, # Your dataset
 outcome = "abnormal_screen", # Binary outcome variable (0/1)
 exposure = "areca_nut", # Exposure of interest
 adjust_vars = c("age", "sex"), # Variables to adjust for
 strata = "residence", # Stratification variables
 link = "auto", # Link function: "auto", "identity", "log", "logit"
 alpha = 0.05, # Significance level (0.05 = 95% CI)
 verbose = FALSE # Print diagnostic messages if TRUE
)

Interpretation Guide

Risk Difference	Interpretation	Public Health Meaning
0.05 (5%)	5 percentage point increase	50 extra cases per 1,000 screened
0.01 (1%)	1 percentage point increase	10 extra cases per 1,000 screened
-0.03 (-3%)	3 percentage point decrease	30 fewer cases per 1,000 screened

When to Use Risk Differences

✅ Use risk differences when:

Communicating to non-statistical audiences
Making policy decisions about interventions
The outcome is relatively common (>5%)
You need to calculate number needed to treat/screen
Comparing absolute impact across different populations

❌ Consider alternatives when:

Outcome is very rare (<1%)
You need to compare across populations with very different baseline risks
Your research question is about biological/etiological mechanisms

Next Steps

Now that you understand the basics:

See the "Complete Example" vignette for a full analysis workflow
Check the "Technical Details" vignette for statistical methodology
Use ?calc_risk_diff for detailed function documentation

Getting Help

Issues or bugs: https://github.com/jackmurphy2351/riskdiff/issues
Function help: ?calc_risk_diff, ?format_risk_diff
All vignettes: browseVignettes("riskdiff")

This vignette is part of the riskdiff package (v0.2.0), developed to make risk difference calculations accessible to public health researchers.