Note: This is not an
rpackage but just a set ofrfunctions aiming at one purpose.
R codes to select the best models fitting three-way tables.
dplyrbroomplyr
packages would be used. Install these before use this set.
- define
find_xname(): used infitted_val.R,goodness_fit.R - modify
broom::tidy.anova()function
function model_loglin()
- fit
glmfor given data - for every pair of independence model
function good_loglin()
- for fitted
model_loglin, - compute goodness-of-fit
- implement with
purrr::map()anddplyr::bind_rows()
function fit_loglin()
- for fitted
model_loglin, - compute fitted values for every pair of independence model
- implement with
purrr::map()andplyr::join_all()
Frow now on, this document will show how these functions can be applied.
- call
tidyverseandbroomlibraries, and the other functions withr/_common.R - fit with
model_loglin() - goodness-of-fit with
good_loglin() - choosing the best model based on the goodness-of-fit statistic
- compute fitted values with
fitted_val
# install.packages("httr")
library(httr)
git_api <- GET("https://api.github.com/repos/ygeunkim/loglinear3/git/trees/master?recursive=1")
stop_for_status(git_api)
repo_list <- unlist(lapply(content(git_api)$tree, "[", "path"), use.names = FALSE)
# R files in r folder -----------------------------
repo_list <- stringr::str_subset(repo_list, pattern = "^r/")
repo_list <- stringr::str_c("https://raw.githubusercontent.com/ygeunkim/loglinear3/master/", repo_list)
# source every file
sapply(repo_list, source)
#> https://raw.githubusercontent.com/ygeunkim/loglinear3/master/r/_common.R
#> value ?
#> visible FALSE
#> https://raw.githubusercontent.com/ygeunkim/loglinear3/master/r/fitted_val.R
#> value ?
#> visible FALSE
#> https://raw.githubusercontent.com/ygeunkim/loglinear3/master/r/goodness_fit.R
#> value ?
#> visible FALSE
#> https://raw.githubusercontent.com/ygeunkim/loglinear3/master/r/model_threeway.R
#> value ?
#> visible FALSEtidyverse and broom are loaded. Here we should modify tidy
function of tidymodels (2018). tidy.anova() is not defined for
anova.glm, so we should add more elements in renamers in the
function. Using the original tidy gives warning message.
renamers <- c(
...
"Ref.df" = "ref.df",
"Deviance" = "deviance", # glm object
"Resid. Df" = "resid.df", # glm object
"Resid. Dev" = "resid.dev" # glm object
)
In addition, find_xname() function would be used later to construct
the various independence models.
Data from Agresti (2012) would be used.
refers to a survey by the Wright State University School of Medicine and the United Health Services in Dayton, Ohio. The survey asked 2276 students in their final year of high school in a nonurban area near Dayton, Ohio, whether they had ever used alcohol, cigarettes, or maijuana.
(substance <-
read_delim("data/Substance_use.dat", delim = " ") %>%
mutate(alcohol = str_trim(alcohol)) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.factor, fct_rev))
#> # A tibble: 8 x 4
#> alcohol cigarettes marijuana count
#> <fct> <fct> <fct> <dbl>
#> 1 yes yes yes 911
#> 2 yes yes no 538
#> 3 yes no yes 44
#> 4 yes no no 456
#> 5 no yes yes 3
#> 6 no yes no 43
#> 7 no no yes 2
#> 8 no no no 279Long data format as above is easy to fit loglinear model. Against wide
format, i.e. contingency table, try tidyr::gather().
For example, look at the below table.
(aids <- read_table("data/AIDS.dat"))
#> # A tibble: 4 x 4
#> race azt yes no
#> <chr> <chr> <dbl> <dbl>
#> 1 white yes 14 93
#> 2 white no 32 81
#> 3 black yes 11 52
#> 4 black no 12 43aids %>%
gather(yes, no, key = symptom, value = count)
#> # A tibble: 8 x 4
#> race azt symptom count
#> <chr> <chr> <chr> <dbl>
#> 1 white yes yes 14
#> 2 white no yes 32
#> 3 black yes yes 11
#> 4 black no yes 12
#> 5 white yes no 93
#> 6 white no no 81
#> 7 black yes no 52
#> 8 black no no 43Finally, we get the long data.
We can write every formula in hand. However, it is annoying.
(subs_hierarchy <-
substance %>%
do(
indep = glm(count ~ alcohol + cigarettes + marijuana, data = ., family = poisson()),
ac_m = glm(count ~ alcohol + cigarettes + marijuana + alcohol:cigarettes,
data = ., family = poisson()),
amcm = glm(count ~ alcohol + cigarettes + marijuana + alcohol:marijuana + cigarettes:marijuana,
data = ., family = poisson()),
acamcm = glm(count ~ alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana,
data = ., family = poisson()),
acm = glm(count ~ alcohol * cigarettes * marijuana,
data = ., family = poisson())
))
#> # A tibble: 1 x 5
#> indep ac_m amcm acamcm acm
#> <list> <list> <list> <list> <list>
#> 1 <glm> <glm> <glm> <glm> <glm>Function can be defined for more general usage. See r/model_threeway.R
model_loglin() fits loglinear model for every pair of independence
model.
.data: datayname: character, response variableglm_fam: random component of glm, option forfamilyofglm(). Since we are fitting loglinear model,poisson()is set to be default.
It returns tibble with list element.
(subs_hierarchy <- model_loglin(substance, yname = "count", glm_fam = poisson()))
#> # A tibble: 1 x 9
#> indep joint1 joint2 joint3 conditional1 conditional2 conditional3 homogen
#> <lis> <list> <list> <list> <list> <list> <list> <list>
#> 1 <glm> <glm> <glm> <glm> <glm> <glm> <glm> <glm>
#> # … with 1 more variable: threefac <list>Each list has glm object with [[1]].
subs_hierarchy$indep[[1]]
#>
#> Call: glm(formula = mutual, family = glm_fam, data = .)
#>
#> Coefficients:
#> (Intercept) alcoholno cigarettesno marijuanano
#> 6.292 -1.785 -0.649 0.315
#>
#> Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
#> Null Deviance: 2850
#> Residual Deviance: 1290 AIC: 1340with residual df = the number of cell count - the number of non-redundant parameters
good_loglin() calculates the above statistic for given option test.
test = "LRT" option of anova.glm() produces
.
test = Chisq
gives . As
noticed, this function is designed to be applied with
purrr::map()
and dplyr::bind_rows().
(subs_good <-
subs_hierarchy %>%
map(good_loglin, test = "LRT") %>%
bind_rows()) %>%
pander::pander()| model | df | resid.df | resid.dev |
|---|---|---|---|
| alcohol + cigarettes + marijuana | 1 | 4 | 1286 |
| alcohol + cigarettes + marijuana + cigarettes:marijuana | 1 | 3 | 534.2 |
| alcohol + cigarettes + marijuana + alcohol:marijuana | 1 | 3 | 939.6 |
| alcohol + cigarettes + marijuana + alcohol:cigarettes | 1 | 3 | 843.8 |
| alcohol + cigarettes + marijuana + alcohol:marijuana + cigarettes:marijuana | 1 | 2 | 187.8 |
| alcohol + cigarettes + marijuana + alcohol:cigarettes + cigarettes:marijuana | 1 | 2 | 92.02 |
| alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana | 1 | 2 | 497.4 |
| alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana | 1 | 1 | 0.374 |
| alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana + alcohol:cigarettes:marijuana | 1 | 0 | 0 |
From , we
compare reduced model to complex model
subs_good %>%
rename(alternative = model) %>%
group_by(resid.df) %>%
slice(which.min(resid.dev)) %>%
arrange(resid.df) %>%
mutate(
goodness = c(resid.dev[1], -diff(resid.dev)),
df_good = c(resid.df[1], -diff(resid.df))
) %>%
mutate(p_value = pchisq(goodness, df = df_good, lower.tail = FALSE)) %>%
pander::pander()| alternative | df | resid.df | resid.dev | goodness |
|---|---|---|---|---|
| alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana + alcohol:cigarettes:marijuana | 1 | 0 | 0 | 0 |
| alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana | 1 | 1 | 0.374 | 0.374 |
| alcohol + cigarettes + marijuana + alcohol:cigarettes + cigarettes:marijuana | 1 | 2 | 92.02 | 92.02 |
| alcohol + cigarettes + marijuana + cigarettes:marijuana | 1 | 3 | 534.2 | 534.2 |
| alcohol + cigarettes + marijuana | 1 | 4 | 1286 | 1286 |
Table continues below
| df_good | p_value |
|---|---|
| 0 | 1 |
| 1 | 0.541 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
(AC, AM, CM)vs saturated model(ACM): cannot rejectwith p-value
0.5408, so we choose(AC, AM, CM)(A, CM)vs(AC, AM, CM): reject
Thus, we use the model (AC, AM, CM)
fit_loglin() computes fitted values for each independent model. Use
this function with purrr::map() and plyr::join_all(). In
plyr::join_all(), every variable name including response is written.
subs_hierarchy %>%
map(fit_loglin) %>%
plyr::join_all(by = c("count", "alcohol", "cigarettes", "marijuana")) %>%
pander::pander()| count | alcohol | cigarettes | marijuana | alcohol + cigarettes + marijuana |
|---|---|---|---|---|
| 911 | yes | yes | yes | 540 |
| 538 | yes | yes | no | 740.2 |
| 44 | yes | no | yes | 282.1 |
| 456 | yes | no | no | 386.7 |
| 3 | no | yes | yes | 90.6 |
| 43 | no | yes | no | 124.2 |
| 2 | no | no | yes | 47.33 |
| 279 | no | no | no | 64.88 |
Table continues below
| alcohol + cigarettes + marijuana + cigarettes:marijuana | alcohol + cigarettes + marijuana + alcohol:marijuana |
|---|---|
| 782.7 | 627.3 |
| 497.5 | 652.9 |
| 39.39 | 327.7 |
| 629.4 | 341.1 |
| 131.3 | 3.284 |
| 83.47 | 211.5 |
| 6.609 | 1.716 |
| 105.6 | 110.5 |
Table continues below
| alcohol + cigarettes + marijuana + alcohol:cigarettes | alcohol + cigarettes + marijuana + alcohol:marijuana + cigarettes:marijuana |
|---|---|
| 611.2 | 909.2 |
| 837.8 | 438.8 |
| 210.9 | 45.76 |
| 289.1 | 555.2 |
| 19.4 | 4.76 |
| 26.6 | 142.2 |
| 118.5 | 0.24 |
| 162.5 | 179.8 |
Table continues below
| alcohol + cigarettes + marijuana + alcohol:cigarettes + cigarettes:marijuana | alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana |
|---|---|
| 885.9 | 710 |
| 563.1 | 739 |
| 29.45 | 245 |
| 470.6 | 255 |
| 28.12 | 0.703 |
| 17.88 | 45.3 |
| 16.55 | 4.297 |
| 264.4 | 276.7 |
Table continues below
| alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana | alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana + alcohol:cigarettes:marijuana |
|---|---|
| 910.4 | 911 |
| 538.6 | 538 |
| 44.62 | 44 |
| 455.4 | 456 |
| 3.617 | 3 |
| 42.38 | 43 |
| 1.383 | 2 |
| 279.6 | 279 |
Agresti, Alan. 2012. Categorical Data Analysis. 3rd ed. Wiley.
tidymodels. 2018. “Stats-Anova-tidiers.R.” https://github.com/tidymodels/broom/blob/master/R/stats-anova-tidiers.R.