book/04-normalgamma-06-MCMC.Rmd at master · StatsWithR/book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
## Markov Chain Monte Carlo (MCMC)  {#sec:NG-MCMC}

```{r packages, echo=FALSE, warning=FALSE, message=FALSE,  eval=TRUE}
library(statsr)
library(ggplot2)
```


The Cauchy prior described in Section \@ref(sec:NG-Cauchy) is not a conjugate prior, and therefore, the posterior distribution from $(\mu \mid \sigma^2)$, is not a Cauchy or any well-known distribution. Fortunately, the conditional distribution of $(\mu, \sigma^2 \mid n_0, \data)$, is normal-gamma and easy to simulate from, as we learned in the previous sections. The conditional distribution of $(n_0 \mid \mu, \sigma^2, \data$) is a gamma distribution, also easy to simulate from the given $\mu, \sigma^2$.

It turns out that if we alternate generating Monte Carlo samples from these conditional distributions, the sequence of samples converges to samples from the joint distribution of $(\mu, \sigma^2, n_0)$, as the number of simulated values increases. The Monte Carlo algorithm we have just described is a special case of Markov chain Monte Carlo (MCMC), known as the Gibbs sampler.

Let's look at the pseudo code for the algorithm.

```{r initalize, include=FALSE, echo=FALSE, eval=FALSE}
S = 10000
sigma2 = rep(NA, S)
mu=rep(NA, S)
n_0 = rep(NA, S)
m_0 = 35
r = 1
p_mu = function(...) {}
p_sigma2 = function(...) {}
p_n_0 = function(...) {}
```

```{r MCMC-pseudo, eval=FALSE}
# initialize MCMC
sigma2[1] = 1; n_0[1]=1; mu[1]=m_0

#draw from full conditional distributions
for (i in 2:S) {
  mu[i]     = p_mu(sigma2[i-1], n_0[i-1],  m_0, r, data)
  sigma2[i] = p_sigma2(mu[i], n_0[i-1],    m_0, r, data)
  n_0[i]    = p_n_0(mu[i], sigma2[i],      m_0, r, data)
}
```

We start with the initial values of each of the parameters for $i=1$. In theory, these can be completely arbitrary, as long as they are allowed values for the parameters.

For each iteration $i$, the algorithm will cycle through generating each parameter, given the **current** value of the other parameters. The functions \texttt{p$\_$mu}, \texttt{p$\_$sigma2}, and \texttt{p$\_$n$\_$0} return a simulated value from the respective distribution conditional on the inputs.

Whenever we update a parameter, we use the **new value** in the subsequent steps as the $n$ draws for $\sigma, n_0$. We will repeat this until we reach iteration $S$, leading to a dependent sequence of s draws from the joint posterior distribution.

Incorporating the tap water example in Section \@ref(sec:normal-gamma), we will use MCMC to generate samples under the Cauchy prior. We set 35 as the location parameter and $r=1$. To complete our prior specification, we use the Jeffrey's reference prior on $\sigma^2$. This combination is referred to as the Jeffrey's Zellner-Siow Cauchy prior or "JZS" in the BayesFactor branch of the R \texttt{statsr} package.


```{r demo-tapwater, include=FALSE, echo=FALSE, message=FALSE, warning=FALSE}
library(statsr)
data(tapwater)

m_0 = 35; n_0 = 25; s2_0 = ((60-10)/4)^2; v_0 = n_0 - 1

Y = tapwater$tthm
ybar = round(mean(Y), 1); s2 = round(var(Y), 1); n = length(Y)
n_n = n_0 + n
m_n = round((n*ybar + n_0*m_0)/n_n, 1)
#v_n = v_0 + n
v_n = n
#s2_n = round(((n-1)*s2 + v_0*s2_0 + n_0*n*(m_0 - ybar)^2/n_n)/v_n, 1)
s2_n = round(((n-1)*s2  + n_0*n*(m_0 - ybar)^2/n_n)/v_n, 1)
set.seed(8675309)
nsim = 10000
phi = rgamma(nsim, v_n/2, s2_n*v_n/2)
sigma = 1/sqrt(phi)
mu = rnorm(nsim, mean=m_n, sd=sigma/(sqrt(n_n)))
y =  rnorm(nsim, mu, sigma)

# reference

phi = rgamma(nsim, (n-1)/2, s2*(n-1)/2)
sigma = 1/sqrt(phi)
post_mu = rnorm(nsim, mean=ybar, sd=sigma/(sqrt(n)))
pred_y =  rnorm(nsim,post_mu, sigma)

library(BayesFactor)
tap.JZS = bayes_inference(y=tthm, data=tapwater, statistic="mean",
                          mu_0 = 35, rscale=1, prior="JZS", nsim=nsim,
                          type="ci", method="sim")
 mu.cauchy = tap.JZS$mu
 pred_y_cauchy =  tap.JZS$samples[,1] + rnorm(nsim, 0, sqrt(tap.JZS$samples[,2]))

df = data.frame(
  parameter = c(rep("NG posterior mu", nsim),
                rep("NG posterior predictive", nsim),
                rep("JZS posterior mu", nsim), rep("JZS posterior predictive", nsim)),
  x = c(mu, y, mu.cauchy, pred_y_cauchy)
)
```

```{r tapwater-inference, fig.align="center", fig.height=4, fig.width=4}
bayes_inference(y=tthm, data=tapwater, statistic="mean",
                mu_0 = 35, rscale=1, prior="JZS",
                type="ci", method="sim")
```

Using the \texttt{bayes$\_$inference} function from the \texttt{statsr} package, we can obtain summary statistics and a plot from the MCMC output -- not only $\mu$, but also inference about $\sigma^2$ and the prior sample size.

The posterior mean under the JZS model is much closer to the sample mean than what the normal gamma prior used previously. Under the informative normal gamma prior, the sample made a 55.5, about eight standard deviations above the mean -- a surprising value under the normal prior. Under the Cauchy prior, the informative prior location has much less influence.

This is **the robustness property of the Cauchy prior**, leading the posterior to put more weight on the sample mean than the prior mean, especially when the prior location is not close to the sample mean. We can see that the central 50% interval for $n_0$ is well below the value 25 used in the normal prior, which placed almost equal weight on the prior in sample mean.

Using the MCMC draws of $\mu, \sigma$, we can obtain Monte Carlo samples from the predictive distribution of $y$, by plugging $\mu$ and $\sigma$ into the corresponding functions. Figure \@ref(fig:hist-ref-pred) compares the posterior densities estimated from the simulative values of $\mu$ and the predicted draws of TTHM under the Jeffrey Zellner-Siow prior, and the informative normal prior from $\mu$ with $n_0 = 25$ and the reference prior on $\sigma^2$.

```{r hist-ref-pred, echo=FALSE, fig.align="center", fig.align="center", fig.height=3, fig.width=6, fig.cap="Comparison of posterior densities", warning=FALSE}

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#000000", "#009E73", "#0072B2", "#D55E00", "#CC79A7")


ggplot(data=df, aes(x=pred_y)) +
  geom_density(aes(x=x, colour=parameter, linetype=parameter), show.legend =FALSE, size=1.2)+
  stat_density(aes(x=x, colour=parameter, linetype=parameter),
               geom="line",position="identity") +
  xlab("TTHM (ppb)") +
  scale_colour_manual(values=cbPalette) +
  theme(panel.background = element_rect(fill = "transparent", colour = NA),
        legend.key = element_rect(colour = "transparent", fill = NA),
        plot.background = element_rect(fill = "transparent", colour = NA),
        legend.position=c(.8, .8),
        text = element_text(size=15))
```

To recap, we have shown how to create more flexible prior distributions, such as the Cauchy distribution using mixtures of conjugate priors. As the posterior distributions are not available in closed form, we demonstrated how MCMC can be used for inference using the hierarchical prior distribution. Starting in the late 1980's, MCMC algorithms have led to an exponential rise in the use of Bayes in methods, because complex models built through hierarchical distributions suddenly were tractable. The Cauchy prior is well-known for being robust prior mis-specifications. For example, having a prior mean that is far from the observed mean. This provides an alternative to the reference prior as a default or objective distribution that is proper.

In the next sections, we will return to Bayes factors and hypothesis testing where the Cauchy prior plays an important role.