diff --git a/README.md b/README.md index bba956e..e0eb3df 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,10 @@ # Assignment #2 Repository -This repository includes the simulated data for Assignment #2. Fork this repository and add your analysis as described in the canvas assignment. - The csv file for `cohort` in the `raw-data` folder includes 5,000 observations with variables `smoke`, `female`, `age`, `cardiac`, and `cost`. +Based on our analysis of the cohort data in the "Assignment 2" R markdown file in the 'analysis' folder, we run a linear regression of cost on age, smoke, female, and cardiac. We find statistical significannce for all coefficients of the covariates. The main results are below: + +Controlling for other covariates, age has a significant incremental effect on cost at ~$16 per year. Fixing all other variables (non-smokers, no cardiac event, average age), females have lower costs than males by $253 per visit. Adjusting for age, smoking status, and gender (baseline is non-smoking men), those having a cardiac event costs $408 on average more than those note having a cardiac event (as the graph in our analysis / pdf shows). Finally, smoking has the largest individual effect on cost of visit ($542 on average more than non-smokers), controlling for all other variables. + +Further checks would be required to ensure homoskedasticity, no multicollinearity, linearity in parameters, and no correlation between residuals (errors) and parameters. This would ensure that our OLS / linear model is the best linear unbiased estimator(s) of our beta coefficients on our covariates. + +I did not use generative AI technology (e.g., ChatGPT) to complete any portion of the analysis for this assignment. diff --git a/analyses and outputs/Assignment2.Rmd b/analyses and outputs/Assignment2.Rmd new file mode 100644 index 0000000..06c8b6c --- /dev/null +++ b/analyses and outputs/Assignment2.Rmd @@ -0,0 +1,72 @@ +--- +title: "Assignment 2" +author: "Natalia Khoudian" +date: "2025-04-23" +output: pdf_document +--- + +```{r} +library(tidyverse) +``` + + +```{r} +cohort_data <- read.csv("/Users/claudiagonzalez/Desktop/Data/raw-data/cohort.csv") +#factor all the categorical variables, #0 reference level for all +cohort_data$female <- factor(cohort_data$female, levels = c(0, 1)) +cohort_data$cardiac <- factor(cohort_data$cardiac, levels = c(0, 1)) +cohort_data$smoke <- factor(cohort_data$smoke, levels = c(0, 1)) +cohort_data$age <- as.numeric(as.integer(cohort_data$age)) #changing to numeric vs integer +summary(cohort_data) #summary statistics for cohort data +``` + +```{r} +model_cohort <- lm(cost ~ cardiac + age + smoke + female, data = cohort_data) +summary(model_cohort) +``` + +```{r} +#plotting relationship between cardiac event and costs, holding fixed gender, age, and smoking +cohort_subset <- data.frame( + cardiac = factor(c(0, 1), levels = levels(cohort_data$cardiac)), + age = mean(cohort_data$age), + smoke = factor(0, levels = levels(cohort_data$smoke)), + female = factor(0, levels = levels(cohort_data$female)) +) +# predicted costs of a cardiac patient vs. non cardiac patient, with average age, male, non-smoker +cohort_subset$pred_cost <- predict(model_cohort, cohort_subset) +``` + +```{r} +#plot +ggplot(cohort_subset, aes(x = factor(cardiac, labels = c("No","Yes")), y = pred_cost)) + + geom_col(fill = "gray70") + + geom_text(aes(label = round(pred_cost,1)), vjust = -0.5) + + labs( + x = "Cardiac Event", + y = "Predicted Cost", + title = "Predicted Cost by Cardiac Status from linear model", + subtitle = paste0("(at age=", round(mean(cohort_data$age),1), + ", nonsmoker, male)") + ) + + theme_minimal() +``` + +```{r} +# Plotting residuals +cohort_data$cardiac <- as.numeric(as.character(cohort_data$cardiac)) +cost_resid <- resid(model_cohort) +cardiac_resid <- resid(lm(cardiac ~ age + smoke + female, data = cohort_data)) + +df_resid <- data.frame(cost_resid, cardiac_resid) +str(df_resid) +ggplot(df_resid, aes(x = cardiac_resid, y = cost_resid)) + + geom_point(alpha = 0.5) + + geom_smooth(method = "lm", se = TRUE) + + labs( + x = "Residual cardiac", + y = "Residual cost", + title = "Residuals of cost vs. cardiac variable (controlling for age, smoking status, gender)" + ) + + theme_minimal() +``` diff --git a/analyses and outputs/Assignment2.pdf b/analyses and outputs/Assignment2.pdf new file mode 100644 index 0000000..a88b53f Binary files /dev/null and b/analyses and outputs/Assignment2.pdf differ