Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Assignment #2 Repository

This repository includes the simulated data for Assignment #2. Fork this repository and add your analysis as described in the canvas assignment.

The csv file for `cohort` in the `raw-data` folder includes 5,000 observations with variables `smoke`, `female`, `age`, `cardiac`, and `cost`.
Based on our analysis of the cohort data in the "Assignment 2" R markdown file in the 'analysis' folder, we run a linear regression of cost on age, smoke, female, and cardiac. We find statistical significannce for all coefficients of the covariates. The main results are below:

Controlling for other covariates, age has a significant incremental effect on cost at ~$16 per year. Fixing all other variables (non-smokers, no cardiac event, average age), females have lower costs than males by $253 per visit. Adjusting for age, smoking status, and gender (baseline is non-smoking men), those having a cardiac event costs $408 on average more than those note having a cardiac event (as the graph in our analysis / pdf shows). Finally, smoking has the largest individual effect on cost of visit ($542 on average more than non-smokers), controlling for all other variables.

Further checks would be required to ensure homoskedasticity, no multicollinearity, linearity in parameters, and no correlation between residuals (errors) and parameters. This would ensure that our OLS / linear model is the best linear unbiased estimator(s) of our beta coefficients on our covariates.

I did not use generative AI technology (e.g., ChatGPT) to complete any portion of the analysis for this assignment.
72 changes: 72 additions & 0 deletions analyses and outputs/Assignment2.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
title: "Assignment 2"
author: "Natalia Khoudian"
date: "2025-04-23"
output: pdf_document
---

```{r}
library(tidyverse)
```


```{r}
cohort_data <- read.csv("/Users/claudiagonzalez/Desktop/Data/raw-data/cohort.csv")
#factor all the categorical variables, #0 reference level for all
cohort_data$female <- factor(cohort_data$female, levels = c(0, 1))
cohort_data$cardiac <- factor(cohort_data$cardiac, levels = c(0, 1))
cohort_data$smoke <- factor(cohort_data$smoke, levels = c(0, 1))
cohort_data$age <- as.numeric(as.integer(cohort_data$age)) #changing to numeric vs integer
summary(cohort_data) #summary statistics for cohort data
```

```{r}
model_cohort <- lm(cost ~ cardiac + age + smoke + female, data = cohort_data)
summary(model_cohort)
```

```{r}
#plotting relationship between cardiac event and costs, holding fixed gender, age, and smoking
cohort_subset <- data.frame(
cardiac = factor(c(0, 1), levels = levels(cohort_data$cardiac)),
age = mean(cohort_data$age),
smoke = factor(0, levels = levels(cohort_data$smoke)),
female = factor(0, levels = levels(cohort_data$female))
)
# predicted costs of a cardiac patient vs. non cardiac patient, with average age, male, non-smoker
cohort_subset$pred_cost <- predict(model_cohort, cohort_subset)
```

```{r}
#plot
ggplot(cohort_subset, aes(x = factor(cardiac, labels = c("No","Yes")), y = pred_cost)) +
geom_col(fill = "gray70") +
geom_text(aes(label = round(pred_cost,1)), vjust = -0.5) +
labs(
x = "Cardiac Event",
y = "Predicted Cost",
title = "Predicted Cost by Cardiac Status from linear model",
subtitle = paste0("(at age=", round(mean(cohort_data$age),1),
", nonsmoker, male)")
) +
theme_minimal()
```

```{r}
# Plotting residuals
cohort_data$cardiac <- as.numeric(as.character(cohort_data$cardiac))
cost_resid <- resid(model_cohort)
cardiac_resid <- resid(lm(cardiac ~ age + smoke + female, data = cohort_data))

df_resid <- data.frame(cost_resid, cardiac_resid)
str(df_resid)
ggplot(df_resid, aes(x = cardiac_resid, y = cost_resid)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(
x = "Residual cardiac",
y = "Residual cost",
title = "Residuals of cost vs. cardiac variable (controlling for age, smoking status, gender)"
) +
theme_minimal()
```
Binary file added analyses and outputs/Assignment2.pdf
Binary file not shown.