-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy pathchapter4.html
More file actions
278 lines (264 loc) · 56.2 KB
/
chapter4.html
File metadata and controls
278 lines (264 loc) · 56.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<title>chapter4</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
</style>
<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" type="text/javascript"></script>
<!--[if lt IE 9]>
<script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
<![endif]-->
</head>
<body>
<h1 id="elementary-hypothesis-testing">Elementary hypothesis testing</h1>
<p><span class="math inline">\(\leftarrow\)</span> <a href="./index.html">Back to Chapters</a></p>
<h2 id="exercise-4.1">Exercise 4.1</h2>
<p>Given <span class="math display">\[P(D_1 ... D_m|H_i X)=\prod_j P(D_j|H_iX)\]</span></p>
<p>and</p>
<p><span class="math display">\[P(D_1 ... D_m|\overline{H}_i X)=\prod_j P(D_j|\overline{H}_iX)\]</span></p>
<p>for <span class="math inline">\(1\leq i\leq n\)</span> and <span class="math inline">\(n>2\)</span> show that for any fixed <span class="math inline">\(i\)</span> at most one of</p>
<p><span class="math display">\[\frac{P(D_j|H_iX)}{P(D_j|\overline{H}_iX)}\]</span></p>
<p>is not equal to <span class="math inline">\(1.\)</span></p>
<p>Proof:</p>
<p>Firstly, we claim that the case of <span class="math inline">\(2\)</span> pieces of data implies the general result.</p>
<p>Indeed, independence assumptions of all <span class="math inline">\(D_j\)</span> together imply analogous pairwise independence for any pair <span class="math inline">\(D_k\)</span> and <span class="math inline">\(D_l\)</span>. So, assuming the case with two data pieces is solved, we have, for a fixed <span class="math inline">\(i\)</span> and for any pair of <span class="math inline">\(k, l\)</span>, either <span class="math inline">\(\frac{P(D_k|H_iX)}{P(D_k|\overline{H}_iX)}=1\)</span>, or <span class="math inline">\(\frac{P(D_l|H_iX)}{P(D_l|\overline{H}_iX)}=1\)</span> (or both), so of the whole set of <span class="math inline">\(\frac{P(D_j|H_iX)}{P(D_j|\overline{H}_iX)}\)</span> at most one is not equal to <span class="math inline">\(1\)</span>, as wanted.</p>
<p>So it is enough to sole the case of only two data sets, <span class="math inline">\(D_1\)</span> and <span class="math inline">\(D_2\)</span>.</p>
<p>We will denote probability density/mass function of <span class="math inline">\(D_1\)</span> under hypothesis <span class="math inline">\(H_iX\)</span> by <span class="math inline">\(V_i\)</span> and that of <span class="math inline">\(D_2\)</span> by <span class="math inline">\(U_i\)</span>. We will also denote probability density/mass function of <span class="math inline">\(D_1\)</span> under hypothesis <span class="math inline">\(\overline{H}_iX\)</span> by <span class="math inline">\(V^c_i\)</span> and that of <span class="math inline">\(D_2\)</span> by <span class="math inline">\(U^c_i\)</span>, though we will not use these until the very end.</p>
<p>Remark 1: The proof actually works for arbitrary (not necessarily discrete or continuous) real-valued random variables, one just has to say that <span class="math inline">\(V\)</span>s and <span class="math inline">\(U\)</span>s are CDFs instead of PMFs/PDFs. The reason for that is, firstly, that independence of two random variables can be equivalently written as joint CDF being a product or as joint PMF/PDF being a product, and, secodnly, that equality of CDFs is equivalent to equality of PMFs/PDFs. We work with PDFs/PMFs out of a strange esthetic choice, rather than for any other reason.</p>
<p>We will denote value of <span class="math inline">\(V_i\)</span> at some <span class="math inline">\(D_1=x_1\)</span> by <span class="math inline">\(v_{i1}\)</span> (and at <span class="math inline">\(D_1=x_2\)</span> by <span class="math inline">\(v_{i2}\)</span>). Similarly the values of <span class="math inline">\(U\)</span> at <span class="math inline">\(D_2=y_1\)</span> will be denoted by <span class="math inline">\(u_{i1}\)</span>.</p>
<p>We will use independence of <span class="math inline">\(D_1\)</span> and <span class="math inline">\(D_2\)</span> conditional on various hypotheses to prove independence under other hypotheses. Note that independence always means <span class="math inline">\(P(D_1=x, D_2=y|H)=P(D_1=x|H) P(D_2=y|H)\)</span> and can be checked by checking this for (arbitrary) specific values.</p>
<p>With this in mind, we pick any pair of possible value pairs <span class="math inline">\(D_1=x_1, D_2=y_1\)</span> and <span class="math inline">\(D_1=x_2, D_2=y_2\)</span>, fixed from now on, and form vectors <span class="math inline">\(v_i=\begin{bmatrix}v_{i1} \\ v_{i2}\end{bmatrix}\)</span> and <span class="math inline">\(u_i=\begin{bmatrix}u_{i1} \\u_{i2}\end{bmatrix}\)</span> <!---$P(D_1=x_1|H_iX)=v^i_1$ and $P(D_1=x_2|H_iX)=v^i_2$, $P(D_2=y_1|H_iX)=u^i_1$ and $P(D_2=y_2|H_iX)=u^i_2$
---></p>
<p>Then independence of <span class="math inline">\(D_1\)</span> and <span class="math inline">\(D_2\)</span> (conditional on <span class="math inline">\(H_iX\)</span>) says that the joint distribution of <span class="math inline">\(P(D_1=x, D_2=y|H_i X)\)</span> is a product of distributions of <span class="math inline">\(D_1\)</span> and <span class="math inline">\(D_2\)</span> and is given (at <span class="math inline">\(x= x_1, x_2\)</span> and <span class="math inline">\(y=y_1, y_2\)</span>) by the matrix</p>
<p><span class="math display">\[M_i=v_i u_i^T=\begin{bmatrix}v_{i1} \\ v_{i2}\end{bmatrix} \begin{bmatrix}u_{i1} & u_{i2}\end{bmatrix}=\]</span></p>
<p><span class="math display">\[=\begin{bmatrix}v_{i1}u_{i1}& v_{i1}u_{i2} \\v_{i2}u_{i1} & v_{i2}u_{i2}\end{bmatrix}\]</span></p>
<p>It follows from this that the (joint) probability matrix of <span class="math inline">\(D_1D_2\)</span> ( again, at <span class="math inline">\(x= x_1, x_2\)</span> and <span class="math inline">\(y=y_1, y_2\)</span>) conditional on <span class="math inline">\(\overline{H_i}X\)</span> is obtained by taking all the matrices of <span class="math inline">\(H_j\)</span> with <span class="math inline">\(j\neq i\)</span> weighing them by (prior) probabilities <span class="math inline">\(h_j\)</span> of <span class="math inline">\(H_j\)</span> and adding them (and then dividing by the sum of the weights, but this is an overall normalizing factor which will not be important for us). That is, the matrix is proportional to</p>
<p><span class="math display">\[\sum_{j\neq i} h_j M_j=\sum_{j\neq i} h_j v_j u_j^T=\]</span></p>
<p><span class="math display">\[\begin{bmatrix}\sum_{j\neq i} h_j v_{j1}u_{j1}& \sum_{j\neq i} h_j v_{j1}u_{j2} \\\sum_{j\neq i} h_j v_{j2} u_{j1} & \sum_{j\neq i} h_j v_{j2} u_{j2}\end{bmatrix}\]</span></p>
<p>Now the assumption that <span class="math inline">\(D_1\)</span> and <span class="math inline">\(D_2\)</span> are independent conditional on <span class="math inline">\(\overline{H}_i\)</span> means this matrix is also a product of “marginal” matrices of <span class="math inline">\(D_1| \overline{H}_i X\)</span> and <span class="math inline">\(D_2|\overline{H}_i X\)</span>, i.e. is of rank 1. This means that it has determiant 0.</p>
<h3 id="three-hypothesis.">Three hypothesis.</h3>
<p>Let’s start with the case of only 3 hypothesis <span class="math inline">\(H_1, H_2,H_3\)</span>. Start with <span class="math inline">\(i=3\)</span>.</p>
<h5 id="lemma-a-sum-m-of-two-rank-1-matrices-mh_1-v_1-ut_1-h_2-v_2-ut_2-can-only-be-rank-1-if-either-v_1-and-v_2-are-linearly-dependent-or-u_1-and-u_2-are-linearly-dependent.">Lemma: A sum <span class="math inline">\(M\)</span> of two rank 1 matrices <span class="math inline">\(M=h_1 v_1 u^T_1 +h_2 v_2 u^T_2\)</span> can only be rank 1 if either <span class="math inline">\(v_1\)</span> and <span class="math inline">\(v_2\)</span> are linearly dependent or <span class="math inline">\(u_1\)</span> and <span class="math inline">\(u_2\)</span> are linearly dependent.</h5>
<p>A “conceptual” proof is as follows: Suppoe <span class="math inline">\(u_1\)</span> and <span class="math inline">\(u_2\)</span> are linearly independent. Then the image of <span class="math inline">\(M\)</span> is spanned by the vectors <span class="math inline">\(M(u_1)\)</span> and <span class="math inline">\(M (u_2)\)</span>; but these are are both linear combinations of <span class="math inline">\(h_1 v_1\)</span> and <span class="math inline">\(h_2 v_2\)</span>. So, if the "the matrix of coefficients <span class="math inline">\(G=\begin{pmatrix} u_1^T u_1& u_1^T u_2\\u_2^T u_1& u_2^T u_2\end{pmatrix}\)</span> is invertible, then image of <span class="math inline">\(M\)</span> is the span of <span class="math inline">\(v_1\)</span> and <span class="math inline">\(v_2\)</span>. But <span class="math inline">\(G\)</span> is the Grammian of <span class="math inline">\(u_1, u_2\)</span> and is invertible precisely when <span class="math inline">\(u_1\)</span> and <span class="math inline">\(u_2\)</span> are linearly independent (its determinant is the square of the area of the parallelogram spanned by <span class="math inline">\(u_1\)</span> and <span class="math inline">\(u_2,\)</span> as you can easily verify). So, for independent <span class="math inline">\(u\)</span>s, the rank of <span class="math inline">\(M\)</span> is the dimension of the span of <span class="math inline">\(v_1, v_2\)</span> and if <span class="math inline">\(v_1, v_2\)</span> were independent, it would be 2. So if rank of <span class="math inline">\(M\)</span> is below 2, then either <span class="math inline">\(u\)</span>s or <span class="math inline">\(v\)</span>s are dependent, as wanted.</p>
<p>Remark 2: Those familiar with tensors may realize that we use metric in which <span class="math inline">\(u_1\)</span> and <span class="math inline">\(u_2\)</span> are orthonormal (that’s the inverse of <span class="math inline">\(G\)</span>) to “raise and index” and go from a bilinear form encoded by <span class="math inline">\(M\)</span> to a linear map, whose range is then the span of <span class="math inline">\(v\)</span>s.</p>
<p>Remark 3: Alternatively, for those who don’t like linear algebra, a computational proof of Lemma 1 is as follows: A (non-zero) 2 by 2 matrix has rank one when its determinant is zero. Writing this out in our case we get:</p>
<p><span class="math display">\[[h_1 v_{11} u_{11}+h_2 v_{21} u_{21}][h_1 v_{12} u_{12}+h_2 v_{22} u_{22}]=\]</span></p>
<p><span class="math display">\[[h_1 v_{11} u_{12}+h_2 v_{21} u_{22}][h_1 v_{12} u_{11}+h_2 v_{22} u_{21}]\]</span></p>
<p>Additively canceling <span class="math inline">\(h_1^2 v_{11}v_{12}u_{11}u_{12}\)</span> and <span class="math inline">\(h_2^2 v_{21}v_{22}u_{21}u_{22}\)</span> and then dividing by <span class="math inline">\(h_1h_2\)</span> we have</p>
<p><span class="math display">\[ v_{11} u_{11} v_{22} u_{22}+ v_{21} u_{21}v_{12} u_{12}= \]</span></p>
<p><span class="math display">\[ v_{11} u_{12} v_{22} u_{21}+ v_{21} u_{22}v_{12} u_{11} \]</span></p>
<p>or <span class="math display">\[(v_{11}v_{22}-v_{12}v_{21})(u_{11}u_{22}-u_{12}u_{21})=0,\]</span></p>
<p>so either <span class="math inline">\(v_1\)</span> and <span class="math inline">\(v_2\)</span> are linearly dependent, or <span class="math inline">\(u_1\)</span> and <span class="math inline">\(u_2\)</span> are.</p>
<p>Continuing with the case of three hypothesis, recall <span class="math inline">\(v_i\)</span> and <span class="math inline">\(u_i\)</span> were likeliehood/probability vectors of <span class="math inline">\(D_1\)</span> taking values <span class="math inline">\(x_1, x_2\)</span> and <span class="math inline">\(D_2\)</span> taking values <span class="math inline">\(y_1, y_2\)</span> (under hypothesis <span class="math inline">\(H_iX\)</span>).</p>
<p>Observe that a pair of non-zero functions <span class="math inline">\(V_1\)</span> and <span class="math inline">\(V_2\)</span> such that <span class="math inline">\((V_1(x_1), V_1(x_2))\)</span> is always proportional to <span class="math inline">\((V_2(x_1), V_2(x_2))\)</span> are “globally” proportional meaning <span class="math inline">\(V_1=kV_2\)</span> (take any <span class="math inline">\(x\)</span> with <span class="math inline">\(V_2(x)\neq 0\)</span> and make <span class="math inline">\(k= V_1(x)/V_2(x)\)</span>).</p>
<p>If distributions of <span class="math inline">\(D_1\)</span> under <span class="math inline">\(H_1X\)</span> and <span class="math inline">\(H_2X\)</span> are different, then they are also not proportional. By the previous paragraph, this implies that there will be two values <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span> giving unproportional <span class="math inline">\(v_1\)</span> and <span class="math inline">\(v_2\)</span>. Then, by the Lemma, for arbitrary pair of values <span class="math inline">\(y_1, y_2\)</span> of <span class="math inline">\(D_2\)</span> the corresponding vectors <span class="math inline">\(u_1\)</span> and <span class="math inline">\(u_2\)</span> are proportional, so, again, by the previous paragraph, the whole probability mass/density functions <span class="math inline">\(U_1\)</span> and <span class="math inline">\(U_2\)</span> of <span class="math inline">\(D_2\)</span> under <span class="math inline">\(H_1X\)</span> and <span class="math inline">\(H_2X\)</span> are proportional, ergo equal.</p>
<p>So either <span class="math inline">\(V_1=V_2\)</span> or <span class="math inline">\(U_1=U_2\)</span>.</p>
<p>Now, in the same way as we just did for <span class="math inline">\(i=3\)</span>, from <span class="math inline">\(i=1\)</span> and <span class="math inline">\(i=2\)</span> we get that (either <span class="math inline">\(V_2=V_3\)</span> or <span class="math inline">\(U_2=U_3\)</span>) and (either <span class="math inline">\(V_1=V_3\)</span> or <span class="math inline">\(U_1=U_3\)</span>). Since we have 3 equalities and only two types of deistributions (<span class="math inline">\(U\)</span> and <span class="math inline">\(V\)</span>), either the <span class="math inline">\(V\)</span>s are equal twice, and <span class="math inline">\(V_1=V_2=V_3\)</span>, or the <span class="math inline">\(U\)</span>s are (and <span class="math inline">\(U_1=U_2=U_3\)</span>). Correspondingly either <span class="math inline">\(D_1\)</span> has same distribution under all 3 hypothesis, and then <span class="math inline">\(\frac{P(D_1|H_iX)}{P(D_1|\overline{H}_iX)}\)</span> are all equal to <span class="math inline">\(1\)</span>, or <span class="math inline">\(D_2\)</span> does (and then all <span class="math inline">\(\frac{P(D_2|H_iX)}{P(D_2|\overline{H}_iX)}\)</span> are equal to <span class="math inline">\(1\)</span>). In either case, we get what we want.</p>
<p>This completes the case of 3 hypothesis.</p>
<h3 id="the-general-case.">The general case.</h3>
<p>To get the extension to more than <span class="math inline">\(3\)</span> hypothesis we use the following approach. As we mentioned before, a 2 by 2 matrix is or rank at most 1 if its determinant is zero. So we need some an efficient way of telling when the deterimnant of 2 by 2 matrix is zero.</p>
<p>Remark 4: More generally, a matrix is of rank at most 1 if all 2 by 2 minors have determinannt zero i.e. all <span class="math inline">\(M^{\wedge 2}_{(i,j), (k,l)}=M_{ik}M_{jl}-M_{il}M_{jk}\)</span> are zero. In tensor analysis, these are the entries of the second exterior power <span class="math inline">\(M^{\wedge 2}\)</span> of <span class="math inline">\(M\)</span>. When dimension is <span class="math inline">\(2\)</span> there is only one minor, and the <span class="math inline">\(M^{\wedge 2}\)</span> is a scalar, equal to <span class="math inline">\(\det M\)</span>. So in dimensions above 2, we can formulate everything that follows in terms of determinants.</p>
<p>We will use the following property of 2D determinants. If <span class="math inline">\(M\)</span> and <span class="math inline">\(N\)</span> are 2 by 2 matrices then</p>
<p><span class="math display">\[D(M, N) :=\frac{1}{2}(\det (M+N)-\det M -\det N)\]</span></p>
<p>is symmetric and bilinear in <span class="math inline">\(M, N\)</span>. This means</p>
<p><span class="math display">\[D(M, N)=D(N, M)\]</span></p>
<p>and</p>
<p><span class="math display">\[D(M_1+M_2, N)=D(M_1, N)+D(M_1, N)\]</span></p>
<p>(and hence the same for second variable). Indeed, one computes</p>
<p><span class="math inline">\(D(M, N)=\frac{1}{2}(M_{11}N_{22}+M_{22}N_{11}-M_{12}N_{21}-M_{21}N_{12})\)</span></p>
<p>and the resulting formula is linear in <span class="math inline">\(M\)</span> and in <span class="math inline">\(N\)</span>, i.e. bilinear.</p>
<p>Observe that <span class="math inline">\(D(M, M)=\det M\)</span>. We then have, by induction on the number of summands,</p>
<p><span class="math inline">\(\det (\sum M_i)=D(\sum M_i, \sum M_j)=\sum_{i, j} D(M_i, M_j)\)</span></p>
<p>Remark 5: We are basically observing that the determinant is quadratic polynomial as a function of the 2 by 2 matrix, so it has a symmetric bilinear extension.</p>
<p>Remark 6: In higher (possibly) dimensions, and using tensor language, we are saying that taking second exterior power, which is quadratic in the matrix input, is a restriction of a symmetric bilinear operation (on two inputs), <span class="math inline">\((M\wedge N)(\vec{a}\wedge\vec{b})=\frac{1}{2}[(M\vec{a})\wedge (N\vec{b})+(N\vec{a})\wedge (M\vec{b})]\)</span></p>
<p><!--- compare https://math.stackexchange.com/questions/1604461/action-of-a-matrix-on-the-exterior-algebra-----></p>
<!---Remrk 5: This is a little strange, but perhaps the following will fact can make slightly more palatable: if matrix $M$ a matrix encoding a PMF of a pair of binary random variables $X_1$ and $X_2$, and we assign value 0 to first outcome and 1 to the second one (or, equivalently, -1/2 to first and 1/2 to the second), then $\det M$ is the covariance $Cov(X_1, X_2)$
Now consider the sum $X_1,X_2$ and $Y_1, Y_2$ independent. Form $X_1+Y_1, X_2Y_2$
----->
<p>Remark 7: We also have <span class="math inline">\(D(\lambda M, N)=\lambda D(M, N)\)</span>, as usual in bilinearity, but we don’t need this.</p>
<p>Now we can apply this to our problem. Let <span class="math inline">\(M_i=h_i v_i u_i^T\)</span> and <span class="math inline">\(N_i=\sum_{i\neq j} M_i\)</span>, and <span class="math inline">\(M=M_i+N_i=\sum_j M_j\)</span>.</p>
<p>Our assumptions are that all <span class="math inline">\(M_i\)</span> and <span class="math inline">\(N_j\)</span> are rank 1 (i.e. have zero determinant). We now show that <span class="math inline">\(M\)</span> has rank <span class="math inline">\(1\)</span> (i.e. has zero determinant).</p>
<p>To that end we write</p>
<p><span class="math display">\[\det M=\sum_{j, k} D(M_j, M_k)\]</span></p>
<p>We want to see that this is zero. We know</p>
<p><span class="math display">\[0=\det (N_i)=\sum_{j\neq i, k\neq i} D(M_j, M_k)\]</span></p>
<p>Summing over <span class="math inline">\(i\)</span> we get (taking note that each <span class="math inline">\(D(M_l, M_l)\)</span> will apear <span class="math inline">\(n-1\)</span> times, while those <span class="math inline">\(D(M_j, M_k)\)</span> with <span class="math inline">\(j\neq k\)</span> will appear only <span class="math inline">\(n-2\)</span> times):</p>
<p><span class="math display">\[\sum_l D(M_l, M_l) + (n-2)\sum_{j,k} D(M_j, M_k)=0\]</span></p>
<p>So, since <span class="math inline">\(D(M_l, M_l)=0\)</span>, as long as <span class="math inline">\(n\neq 2\)</span> we have what we want.</p>
<p>This gives <span class="math inline">\(M=vu^T\)</span>. Going back to <span class="math inline">\(M=M_i+N_i\)</span> we again see two rank one matrices add up to a rank one matrix. We conclude, just as in the case of 3 hypothesis, that for each specific <span class="math inline">\(i\)</span>, either <span class="math inline">\(V^c_i=V_i\)</span> and hence <span class="math inline">\(\frac{P(D_1|H_iX)}{P(D_1|\overline{H}_iX)}=1\)</span></p>
<p>OR</p>
<p><span class="math inline">\(U^c_i=U_i\)</span> and hence <span class="math inline">\(\frac{P(D_2|H_iX)}{P(D_2|\overline{H}_iX)}=1\)</span>. This is exactly what we wanted to prove.</p>
<h2 id="some-remarks-on-sequential-vs.-batch-updates.">Some remarks on sequential vs. batch updates.</h2>
<p>(See formula 4.11; the discussion is in the context of section 4.4 and formula 4.44.)</p>
<p>We are considering a batch of widgets from a single machine. We have three hypothesis, <span class="math inline">\(A\)</span> – the machine has failure rate <span class="math inline">\(\frac{1}{3}\)</span> (prior probability <span class="math inline">\(\frac{1}{11}\)</span>), <span class="math inline">\(B\)</span> – the machine has failure rate <span class="math inline">\(\frac{1}{6}\)</span> (prior probability <span class="math inline">\(\frac{10}{11}\)</span>), and <span class="math inline">\(C\)</span> – the machine has failure rate <span class="math inline">\(\frac{99}{100}\)</span> (prior probability <span class="math inline">\(\frac{1}{10^6}\)</span>).</p>
<p>Prior odds ratio is then (to a very good approximation) <span class="math inline">\(\frac{10^6}{11} : \frac{10^7}{11} : 1\)</span>.</p>
<p>The update rule for arbitrary data tells us how to compute the posterior odds <span class="math inline">\(A\)</span> vs <span class="math inline">\(B\)</span> vs <span class="math inline">\(C\)</span>: one takes the prior odds and multiplies each number by the likeliehood of data under corresponding hypothesis. For data of <span class="math inline">\(m\)</span> bad widgets the likeliehhods are <span class="math inline">\(\frac{1}{3^m},\frac{1}{6^m},\frac{99^m}{100^m}\)</span>. This gives posterior odds:</p>
<p><span class="math display">\[\frac{10^6}{11} \frac{1}{3^m} : \frac{10^7}{11} \frac{1}{6^m}:\frac{1}{10^6}\frac{99^m}{100^m}.\]</span></p>
<p>Note that since the individual draws are independent <strong>conditional on one specific hypothesis</strong>, the likeliehoods for <span class="math inline">\(m\)</span> widgets are each a product of individual likelihoods of draws, and we see very clearly that the same posterior odds will be obtained by updating on all the data of <span class="math inline">\(m\)</span> bad widgets, or by splitting <span class="math inline">\(m=m_1+m_2\)</span> and updating first to odds ratio</p>
<p><span class="math display">\[\frac{10^6}{11} \frac{1}{3^{m_1}} : \frac{10^7}{11} \frac{1}{6^{m_1}}:\frac{1}{10^6}\frac{99^{m_1}}{100^{m_1}},\]</span></p>
<p>using that as new prior odds and then updating to the final posterior odds</p>
<p><span class="math display">\[\frac{10^6}{11} \frac{1}{3^{m_1}} \frac{1}{3^{m_2}} : \frac{10^7}{11} \frac{1}{6^{m_1}} \frac{1}{6^{m_2}}:\frac{1}{10^6}\frac{99^{m_1}}{100^{m_1}} \frac{99^{m_2}}{100^{m_2}}\]</span></p>
<p>getting the same result.</p>
<p>It is only when merging <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> into a single hypothesis <span class="math inline">\(\overline{C}=A+B\)</span> that we have trouble: individual draws are not independent under <span class="math inline">\(\overline{C}\)</span>. To illustrate, consider <span class="math inline">\(m=2\)</span>. Firstly, the prior odds of <span class="math inline">\(C: \overline{C}\)</span> are</p>
<p><span class="math display">\[C_0:\overline{C}=C_0:A_0+B_0= 1 : \frac{10^6}{11} + \frac{10^7}{11}=1:10^6\]</span></p>
<p>The likeliehood of the first widget being bad under <span class="math inline">\(C\)</span> is as before <span class="math inline">\(\frac{99}{100}\)</span>. The likeliehood of it being bad under <span class="math inline">\(\overline{C}=A+B\)</span> is trickier: <span class="math inline">\(B\)</span> is 10 times more likely than <span class="math inline">\(A\)</span>, so the probability of a bad widget under <span class="math inline">\(A+B\)</span> is weighted average of probabilities under <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> with the one under <span class="math inline">\(B\)</span> getting ten times more weight. This is simply the “total probability formula”</p>
<p><span class="math display">\[P(D_1|A+B)=P(D_1|A)P(A|A+B)+P(D_1|B)P(B|A+B)=\]</span></p>
<p><span class="math display">\[=\frac{1}{3}\frac{1}{11} +\frac{1}{6}\frac{10}{11}=\frac{2}{11}\]</span></p>
<p>After the first widget is drawn, the posterior odds are prior odds times likeliehood:</p>
<p><span class="math display">\[C_1:\overline{C}_1= 1 \frac{99}{100}: 10^6 \frac{2}{11}\]</span></p>
<p>We can check it’s the same result as one obtained by first updating the odds of <span class="math inline">\(A:B:C\)</span> and then summing the ones for <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>:</p>
<p><span class="math display">\[C_1:A_1+B_1=\frac{99^{1}}{100^{1}}: \frac{10^6}{11} \frac{1}{3^{1}} + \frac{10^7}{11} \frac{1}{6^{1}}=\frac{99}{100} :\frac{2\cdot10^6}{11} \]</span></p>
<p>Now <strong>if</strong> the likeliehood of second bad widget under <span class="math inline">\(\overline{C}\)</span> was again <span class="math inline">\(P(D_2|A+B)=\frac{2}{11}\)</span> we would get the <strong>WRONG</strong> <span class="math inline">\(C_2:\overline{C}_2= \frac{99}{100}\frac{99}{100}: \frac{2\cdot10^6}{11} \frac{2}{11}\)</span>.</p>
<p>However, it is not - <span class="math inline">\(D_2\)</span> is not independent of <span class="math inline">\(D_1\)</span> under <span class="math inline">\(\overline{C}\)</span> – having learned that the first draw was defective we now think it is relatively more likely that the batch came from <span class="math inline">\(A\)</span> rather than <span class="math inline">\(B\)</span>, as follows. The odds ratios after <span class="math inline">\(D_1\)</span> of <span class="math inline">\(A_1:B_1=1 \frac{1}{3}:\frac{1}{6}{10}=1:5\)</span>, so <span class="math inline">\(P(A|(A+B)D_1)=\frac{1}{6}\)</span> and <span class="math inline">\(P(B|(A+B)D_1)=\frac{5}{6}\)</span>. This means that the second widget is more likely to come from <span class="math inline">\(A\)</span> rather than <span class="math inline">\(B\)</span> and is more likely to be bad:</p>
<p><span class="math display">\[P(D_2|(A+B)D_1)=\]</span></p>
<p><span class="math display">\[P(D_2|AD_1)P(A|(A+B)D_1)+P(D_2|BD_1)P(B|(A+B)D_1)=\]</span></p>
<p><span class="math display">\[=\frac{1}{3}\frac{1}{6} +\frac{1}{6}\frac{5}{6}=\frac{7}{36}> \frac{2}{11}\]</span></p>
<p>This means that seeing the second bad widget is giving somewhat less evidence for <span class="math inline">\(C\)</span> over <span class="math inline">\(\overline{C}\)</span> than seeing the first one did (some evidence in <span class="math inline">\(D_2\)</span> is “already” in <span class="math inline">\(D_1\)</span>).</p>
<p>We can now finish the computation for the second update:</p>
<p><span class="math display">\[ C_2 : \overline{C}_2 = \frac{99}{100} \frac{99}{100} : \frac{2\cdot10^6}{11} \frac{7}{36}=\frac{99^2}{100^2}: \frac{14 \cdot 10^6}{11\cdot 6^2}\]</span></p>
<p>and compare it to the “batch” calculation:</p>
<p><span class="math display">\[C_2:A_2+B_2= \frac{99^2}{100^2}: \frac{10^6}{11}\frac{1}{3^2}+ \frac{10^7}{11} \frac{1}{6^2}=\frac{99^2}{100^2}: \frac{14 \cdot 10^6}{11\cdot 6^2}.\]</span></p>
<h2 id="most-of-exercise-4.2">Most of Exercise 4.2</h2>
<p>Suppose prior odds of hypothesis <span class="math inline">\(H_i\)</span> are <span class="math inline">\(s_i\)</span>, and the probability of good widget under <span class="math inline">\(H_i\)</span> is <span class="math inline">\(p_i\)</span> and of a bad one is <span class="math inline">\(q_i=1-p_i\)</span>.</p>
<p>For <span class="math inline">\(n_0\)</span> bad widgets and <span class="math inline">\(n_1\)</span> good ones the posterior odds are</p>
<p><span class="math display">\[ \ldots : L_i : \ldots = \ldots : s_i p_i^{n_0} q_i^{n_1}: \ldots\]</span></p>
<p>If the fraction of bad widgets is <span class="math inline">\(f\)</span> then we have <span class="math inline">\(n_0=f n\)</span>, <span class="math inline">\(n_1=(1-f)n\)</span> and posterior odds</p>
<p><span class="math display">\[ \ldots :L_i: \ldots = \ldots : s_i (p_i^{1-f})^n (q_i^{f})^{n}: \ldots\]</span></p>
<p>For two different hypotheses the ratio <span class="math inline">\(L_i: L_j\)</span> does not go to zero or infinity precisely when <span class="math inline">\(p_i^{1-f} q_i^{f}=p_j^{1-f} q_j^{f}\)</span>. Solving for <span class="math inline">\(f\)</span> we get:</p>
<p><span class="math display">\[\frac{p_i}{p_j}=(\frac{q_j p_i}{q_i p_j})^f\]</span></p>
<p><span class="math display">\[\ln \frac{p_i}{p_j}=f\ln \frac{q_j p_i}{q_i p_j}\]</span></p>
<p><span class="math display">\[f= \frac{\ln \frac{p_i}{p_j}}{\ln \frac{q_j}{q_i}+\ln \frac{p_i}{p_j}}\]</span></p>
<p><span class="math display">\[f=\frac{1}{\frac{\ln \frac{q_j}{q_i}}{\ln \frac{p_j}{p_i}}+1}\]</span></p>
<p>For the case in section 4.4 we have <span class="math inline">\(p_C=1/100\)</span>, <span class="math inline">\(p_A=2/3\)</span>, <span class="math inline">\(q_C=99/100\)</span>, <span class="math inline">\(q_A=1/3\)</span>,</p>
<p><span class="math inline">\(f=\frac{1}{\frac{\ln \frac{q_j}{q_i}}{\ln \frac{p_j}{p_i}}+1}=\frac{1}{\frac{\ln \frac{297}{100}}{\ln \frac{200}{3} }+1}=0.7941552598429126932775507544006687123649252334748242\)</span></p>
<h3 id="additional-notes-on-4.2-proof">Additional notes on 4.2 proof</h3>
<p>The likelihood of hypothesis B is irrelevant to calculating the threshold <span class="math inline">\(f_t\)</span> because the the point where the evidence is balanced (in the limit) between hypotheses A and C only depends on the relationship between the two likelihoods. The prior probability of C is also irrelevant to the calculation.</p>
<p><strong>Note</strong>: the value of <span class="math inline">\(f_t\)</span> given in the book is incorrect, the correct answer is above. This is noted in the <a href="http://ksvanhorn.com/bayes/jaynes/node7.html">errata</a>.</p>
<h2 id="some-of-exercise-4.3">Some of Exercise 4.3</h2>
<p>We now consider the case of 4 hypothesis <span class="math inline">\(A, B, C\)</span> as before and <span class="math inline">\(F\)</span>. We remark that since we will study the evolution of our beliefs in the scenario where about <span class="math inline">\(1/4\)</span> of widgets are defective, neither <span class="math inline">\(C\)</span> nor the hypothetical <span class="math inline">\(E\)</span> from the first part of this exercise under which most widgets are good are of any importance, so we can, in fact just study how <span class="math inline">\(F\)</span> fairs against <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>.</p>
<p>The likeliehoods of data of <span class="math inline">\(n/4\)</span> bad widgets and <span class="math inline">\(3n/4\)</span> good ones under <span class="math inline">\(A:B:F\)</span> are</p>
<p><span class="math display">\[ \left[\left(\frac{1}{3}\right)^{\frac{1}{4}} \left(\frac{2}{3}\right)^{\frac{3}{4}} \right]^n: \left[\left(\frac{1}{6}\right)^{\frac{1}{4}} \left(\frac{5}{6}\right)^{\frac{3}{4}} \right]^n: \left[\left(\frac{1}{4}\right)^{\frac{1}{4}} \left(\frac{3}{4}\right)^{\frac{3}{4}} \right]^n \]</span></p>
<p><span class="math display">\[0.5606^n: 0.5573^n:0.5699^n\]</span></p>
<p>So the evidence for <span class="math inline">\(F\)</span> is (in Jaynesean db) <span class="math inline">\(10(\log_{10} 0.5699^n- \log_{10} (0.5606^n+ 0.5573^n))\)</span> which for large <span class="math inline">\(n\)</span> is <span class="math inline">\(n\cdot 10 \log_{10} \frac{0.5699}{0.5606} \approx n \cdot 0.0714556 \approx \frac{n}{14}\)</span>. So we would need about 14 samples per decibel of evidence in favor of <span class="math inline">\(F\)</span>, i.e. multiple hundreds before something like 20db of evidence for <span class="math inline">\(F\)</span> over <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> can be collected.</p>
<h2 id="exercise-4.4">Exercise 4.4</h2>
<p>This problem is simplified if we assume that <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> are the only hypothesis under consideration. Then evidence for <span class="math inline">\(A\)</span> is evidence against <span class="math inline">\(B\)</span> and vice versa. In addition, the individual widget tests are independent both under <span class="math inline">\(A=\overline{B}\)</span> and <span class="math inline">\(B=\overline{A}\)</span>.</p>
<p>So each negative test gives <span class="math inline">\(10 \log_{10} \frac{1/3}{1/6}=3\)</span> db evidence for <span class="math inline">\(A\)</span>. If <span class="math inline">\(B\)</span> is true, the probability of this is <span class="math inline">\(\frac{1}{6}\)</span>.</p>
<p>And a positive gives <span class="math inline">\(10 \log_{10} \frac{2/3}{5/6}=-0.97\)</span> db evidence against <span class="math inline">\(A\)</span>. If <span class="math inline">\(B\)</span> is true, the probability of this is <span class="math inline">\(\frac{5}{6}\)</span>.</p>
<p>We now indeed have a random walk. We start at <span class="math inline">\(0\)</span> db of evidence, and then at each step we either, with probability <span class="math inline">\(\frac{1}{6}\)</span> take a step of size <span class="math inline">\(3\)</span>, or, with probability <span class="math inline">\(\frac{5}{6}\)</span> take a step of size <span class="math inline">\(-0.97\)</span>.</p>
<p>We now ask: 1) what is the probability of ever getting to <span class="math inline">\(20\)</span> db? 2) how long will it take to get <span class="math inline">\(-20\)</span> db?</p>
<p>This is an exercise in “Wald theory” a la A. Wald, “Sequential analysis” or W. Feller, “An introduction to probability theory and its applications”, Vol 1 Ch. XIV., Vol 2 Ch.XII (Abraham Wald is the statistician of the “holes you see in the returning bombers are in the areas where a bomber can take a hit and still come back, it’s all the other areas you should worry about, you dolts!” fame; someone really should make a movie about his life.)</p>
<p>One gets a simplification by rounding <span class="math inline">\(-0.97\)</span> to <span class="math inline">\(-1\)</span>, for then the possible amounts of evidence are all integers, and the random walk is on <span class="math inline">\(\mathbb{Z}\)</span>. Moreover, the resulting random walk is “skip-free to the left” - one can not move left by more than one unit at a time. There are a more or less standard approaches for analyzing such a walk.</p>
<p>(Without simplifying exact solution becomes intractable, but one can get various approximations. We derive only very basic such approximation at the end this write-up, but it is discussed by Wald and in Vol II of Feller.)</p>
<h5 id="probability-of-getting-20-db-for-a.">Probability of getting 20 db for <span class="math inline">\(A\)</span>.</h5>
<p>Let <span class="math inline">\(p_i\)</span> be the probability that the walk will ever reach the state <span class="math inline">\(i\)</span> decibel.</p>
<p>Firstly, the expected evidence is <span class="math inline">\(3\frac{1}{6}+(-1)\frac{5}{6}=-\frac{1}{3}\)</span> is negative, so by the strong law of large numbers the walk goes off to <span class="math inline">\(-\infty\)</span> with probability 1. Since the only way to reach it starting from 0 is to go through all negative values, <span class="math inline">\(p(i)=1\)</span> for all <span class="math inline">\(i\leq 0\)</span>.</p>
<p>In addition, this means that, if we denote by <span class="math inline">\(X_j\)</span> the set of all walks that reach state <span class="math inline">\(i\)</span> but don’t reach state <span class="math inline">\(i+1\)</span> then <span class="math inline">\(X=\cup_{j\geq 0} X_j\)</span> has probability <span class="math inline">\(1\)</span> (each walk that converges to <span class="math inline">\(-\infty\)</span> will belong to one of the <span class="math inline">\(X_i\)</span>). This means that <span class="math inline">\(p_i=P(\cup_{j\geq i-1} X_i)\)</span> converges to zero as <span class="math inline">\(i \to \infty\)</span>.</p>
<p>The above two paragraphs give boundary conditions for <span class="math inline">\(p_i\)</span>. Now let’s find a relation between different <span class="math inline">\(p_i\)</span>s (for positive <span class="math inline">\(i\)</span>). Consider the first step of the walk. Either, with probability <span class="math inline">\(\frac{1}{6}\)</span> it is 3 to the right, and so what remains is to reach <span class="math inline">\(i-3\)</span> steps to the right, probability of which is <span class="math inline">\(p_{i-3}\)</span>. Or, with probability <span class="math inline">\(\frac{5}{6}\)</span> it is 1 to the left, and so now the walk needs to reach <span class="math inline">\(i+1\)</span> steps to the right, probability of which is <span class="math inline">\(p_{i+1}\)</span>. So</p>
<p><span class="math display">\[p_{i}=\frac{1}{6} p_{i-3}+ \frac{5}{6} p_{i+1}\]</span></p>
<p>The first few are <span class="math inline">\(p_1=\frac{1}{6}+\frac{5}{6} p_{2}\)</span>, <span class="math inline">\(p_2=\frac{1}{6}+\frac{5}{6} p_{3}\)</span>, <span class="math inline">\(p_3=\frac{1}{6}+\frac{5}{6} p_4\)</span>, <span class="math inline">\(p_4=\frac{1}{6}p_1+\frac{5}{6} p_{5}\)</span> etc.</p>
<p>Remark: <span class="math inline">\(\frac{1}{6} p_{-3}+ \frac{5}{6} p_{1}=\frac{1}{6}+ \frac{5}{6} p_{1}\)</span> is the probability of returning to <span class="math inline">\(0\)</span> <strong>after</strong> the firs step, and so in <strong>not</strong> equal to <span class="math inline">\(p_0=1\)</span>.</p>
<p>There is a standard way to solve such recurrence relations, and even several variations on it.</p>
<p><strong>In a more direct way</strong>, one starts with <span class="math inline">\(p_i=r^i\)</span> to get characteristic equation</p>
<p><span class="math display">\[ r^3= \frac{1}{6}+ \frac{5}{6} r^4 .\]</span></p>
<p>This has roots <span class="math inline">\(r_1=1\)</span>, <span class="math inline">\(r_2\approx 0.8\)</span>, <span class="math inline">\(r_{3,4}\approx -0.3 \pm 0.4 i\)</span> (so <span class="math inline">\(|r_{3,4}|\approx 0.5\)</span>). A general solution is a linear combination of these, <span class="math inline">\(p_i=c_1 r_1^i+c_2 r_2^i+c_3 r_3^i+ c_4 r_4^i\)</span>. The boundary condition <span class="math inline">\(\lim_{i \to \infty} p_i=0\)</span> means <span class="math inline">\(c_1=0\)</span>, and the other conditions <span class="math inline">\(p_0=p_{-1}=p_{-2}=1\)</span> give</p>
<p><span class="math display">\[\begin{pmatrix} 1 &1 &1\\r_2^{-1}&r_3^{-1}&r_4^{-1}\\r_2^{-2}&r_3^{-2}&r_4^{-2} \end{pmatrix}\begin{pmatrix}c_2\\c_3\\c_4\end{pmatrix}=\begin{pmatrix}1\\1\\1\end{pmatrix}\]</span></p>
<p><span class="math display">\[\begin{pmatrix}c_2\\c_3\\c_4\end{pmatrix}=\begin{pmatrix} 1 &1 &1\\r_2^{-1}&r_3^{-1}&r_4^{-1}\\r_2^{-2}&r_3^{-2}&r_4^{-2} \end{pmatrix}^{-1}\begin{pmatrix}1\\1\\1\end{pmatrix}\]</span></p>
<p>I seem to get <span class="math inline">\(c_4=0.07492831-0.0259995i\)</span>, <span class="math inline">\(c_3=0.07492831+0.0259995i\)</span>, <span class="math inline">\(c_2=0.85014338\)</span>. This means that we can ignore the smaller modulus complex roots, and write <span class="math inline">\(p_i\approx 0.85 \cdot (0.7823728)^i\)</span>. For <span class="math inline">\(i=20\)</span> we have</p>
<p><span class="math display">\[p_{20}\approx 0.006276157.\]</span></p>
<p>This can probably be described as “negligibly small”.</p>
<p><strong>Alternatively</strong>, another way (Feller, Chapter XI.1 and XI.4, particularly Example b) is to define <span class="math inline">\(Q(s)=\sum_{i=-2}^{\infty} p_i s^i\)</span> and observe that starting from</p>
<p><span class="math display">\[p_{i}=\frac{1}{6} p_{i-3}+ \frac{5}{6} p_{i+1}\]</span></p>
<p>multiplying by <span class="math inline">\(s^i\)</span>, and summing over <span class="math inline">\(i\geq 1\)</span>, using <span class="math inline">\(p_{-2}=p_{-1}=p_0=1\)</span> and <span class="math inline">\(p_1=p\)</span> gives:</p>
<p><span class="math display">\[Q(s)-1-s^{-1}-s^{-2}= \frac{1}{6}s^3Q(s)+\frac{5}{6}(\frac{1}{s}(Q(s)-sp-1-s^{-1}-s^{-2}))\]</span></p>
<p><span class="math display">\[\frac{5}{6}(ps^3+s^2+s+1)-(s^3+s^2+s)=Q(s)s^2(\frac{1}{6}s^4-s+\frac{5}{6})\]</span></p>
<p><span class="math display">\[Q(s)s^2=\frac{\frac{5}{6}(ps^3+s^2+s+1)-(s^3+s^2+s)}{\frac{1}{6}s^4-s+\frac{5}{6}}\]</span></p>
<p>This means that <span class="math inline">\(Q(s)s^2=\frac{U(s)}{V(s)}\)</span> is a rational function and has partial fraction expansion <span class="math inline">\(\sum_{j=1}^{4}\frac{a_k}{s_k-s}\)</span> with <span class="math inline">\(s_k\)</span> the roots of <span class="math inline">\(V(s)\)</span> (luckily in our case the roots of <span class="math inline">\(V\)</span> are distinct) and</p>
<p><span class="math display">\[a_k=\frac{-U(s_k)}{V'(s_k)}\]</span></p>
<p>and</p>
<p><span class="math display">\[p_{i-2}=\frac{a_1}{s_1^{i+1}}+\frac{a_2}{s_2^{i+1}}+\frac{a_3}{s_3^{i+1}}+\frac{a_4}{s_4^{i+1}}.\]</span></p>
<p>We note that the roosts <span class="math inline">\(s_k\)</span> of <span class="math inline">\(V(s)\)</span> are the inverses of the roots <span class="math inline">\(r_k\)</span> in our previous “direct approach”, and the formula above is thus identical with the one we got before.</p>
<p>One of the roots of <span class="math inline">\(V(s)\)</span> is <span class="math inline">\(1\)</span>, but the corresponding coefficient <span class="math inline">\(c_1\)</span> must be <span class="math inline">\(0\)</span> otherwise the limit <span class="math inline">\(p_i\)</span> will not be zero. This gives <span class="math inline">\(U(1)=0\)</span> meaning <span class="math inline">\(\frac{5}{6}(p+3)-3=0\)</span>, or <span class="math inline">\(p=\frac{3}{5}.\)</span></p>
<p>The next smallest in absolute value root is <span class="math inline">\(s_2\approx 1.278\)</span>. The corresponding coefficient</p>
<p><span class="math display">\[a_2=\frac{-U(s_2)}{V'(s_2)}\approx\frac{0.696}{1.225424}\approx1.775215\]</span></p>
<p>giving</p>
<p><span class="math display">\[p_{20}\approx \frac{1.775215}{(1.2781)^{23}}\approx 0.0062772128\]</span></p>
<p>as before (within precision of calculations). We could compute <span class="math inline">\(a_3\)</span> and <span class="math inline">\(a_4\)</span> to make sure they are not huge and ignoring the other terms is justified, but we have already seen this in the previous method, so we omit this calculation.</p>
<p><strong>Remark</strong>: The fact that one can use these two methods to compute the coefficients <span class="math inline">\(c_k\)</span> in the expansion of <span class="math inline">\(p_i\)</span> means that the inverse of the Vandermonde matrix is computable via partial fractions expansion. This has been noted in the literature.</p>
<p><!--
c_2=0.850143379135273
0.7823728
1.278163
s_2=1.278163072798148594862153110738810942822842764694097321941
\frac{2.086860726169758341196705432583458069289117363026753422832365}{1.2254240576913640671052130911121400436831404124458669595926648}=1.702970260026799764438894848068807946323626715997234302853$
----></p>
<h5 id="time-to-get-20-db-for-b.">Time to get 20 db for <span class="math inline">\(B\)</span>.</h5>
<p><span class="math inline">\(\phantom{}\)</span> <strong>For a shortcut skip to the bottom “Wald identities”.</strong></p>
<p>Since the the walk is skip-free on the left, to get <span class="math inline">\(20\)</span> db for <span class="math inline">\(B\)</span> one has to first get <span class="math inline">\(19\)</span> db, and for that one needs to first get <span class="math inline">\(18\)</span> db etc. So we start by getting a distribution of times to get <span class="math inline">\(1\)</span> db for <span class="math inline">\(B\)</span>. Let <span class="math inline">\(T\)</span> be the (random) time until the walk hits <span class="math inline">\(-1\)</span>. We define the formal power series</p>
<p><span class="math display">\[f(x)=E(x^T)=xP(T=1)+x^2P(T=2)+...\]</span></p>
<p>encoding the distribution of <span class="math inline">\(T\)</span>. We have <span class="math inline">\(f(1)=1\)</span> and <span class="math inline">\(f(0)=0\)</span>. We also have <span class="math inline">\(f'(1)=E(T)\)</span> and <span class="math inline">\(f''(1)=E(T(T-1))\)</span> and so on.</p>
<p>Remark: This annoying <span class="math inline">\(T^2-T\)</span> as opposed to just <span class="math inline">\(T\)</span> can be avoided by changing variables <span class="math inline">\(x=e^{u}\)</span> (giving moment generating function of <span class="math inline">\(T\)</span>), or, even better, <span class="math inline">\(x=e^{iu}\)</span> (giving characteristic function of <span class="math inline">\(T\)</span>). This is in some ways nicer, but we stick to generating function version and its polynomial look and feel.</p>
<p>Then the distribution of (random) time to hit <span class="math inline">\(-k\)</span> is given by writing <span class="math inline">\(T_k=Z_1+Z_2+\ldots+Z_k\)</span> with <span class="math inline">\(Z_i\)</span> i.i.d. with same distribution as <span class="math inline">\(T\)</span>, so that the formal power series encoding the distribution of <span class="math inline">\(T_k\)</span> is</p>
<p><span class="math display">\[f_k(x)= E(x^{T_k})=E(x^{\sum Z_i})=E(\prod x^{Z_i})=\prod E( x^{Z_i})=f(x)^k\]</span></p>
<p>Now consider <span class="math inline">\(T\)</span> again, and condition on the first step of the walk. With probability <span class="math inline">\(\frac{5}{6}\)</span> it is to the left and <span class="math inline">\(T=1\)</span>. With probability <span class="math inline">\(\frac{1}{6}\)</span> it is to the right and then <span class="math inline">\(T\)</span> is distributed as encoded by <span class="math inline">\(f_4(x)=xf(x)^4\)</span> (the extra <span class="math inline">\(x\)</span> is for the one extra time beat that the first step used). So overall <span class="math inline">\(T\)</span>, being a mixture of these two possibilities, is distributed as <span class="math inline">\(\frac{5}{6}x+\frac{1}{6}f(x)^4\)</span>. Hence</p>
<p><span class="math display">\[f(x)=\frac{5}{6}x+\frac{1}{6}xf(x)^4\]</span></p>
<p>One could write <span class="math inline">\(f(x)=c_1x+c_2x^2+...\)</span> and plug in to get</p>
<p><span class="math display">\[c_1x+c_2x^2+...=\frac{5}{6}x +\frac{1}{6}x(c_1x+c_2x^2+...)^4\]</span></p>
<p>which can be rewritten as a “upper triangular” system of equations to be solved one after another, getting <span class="math inline">\(c_1=\frac{5}{6}, c_2=c_3=c_4=0, c_5=\frac{1}{6}c_1^4=\frac{5^4}{6^5}, c_6= \frac{1}{6}4c_1^3c_2=0\)</span> etc. Then, of course it is <span class="math inline">\(f(x)^{20}\)</span> that gives the distribution of times to hit <span class="math inline">\(20\)</span> db that we are after.</p>
<p>Abandoning the pursuit of full distribution of <span class="math inline">\(T_{20}\)</span> we instead compute its first and second moments.</p>
<p>First we deal with the moments of <span class="math inline">\(T\)</span>. Differentiating the equationfor <span class="math inline">\(f\)</span> we have</p>
<p><span class="math display">\[f'=\frac{5}{6}+\frac{1}{6}f^4+\frac{1}{6}x 4 f' f^3\]</span></p>
<p>Since <span class="math inline">\(f(1)=1\)</span> we get <span class="math inline">\(f'(1)=1+\frac{4}{6}f'(1)\)</span>, <span class="math inline">\(f'(1)=3\)</span>.</p>
<p>We have <span class="math inline">\(E(T_{20})=(f^{20})'(1)=20 f'(1)f(1)^{19}=20\times 3=60\)</span>. This tells us the expected time to get 20 db for <span class="math inline">\(B\)</span> is 60 samples. (For general <span class="math inline">\(k\)</span> we get <span class="math inline">\(E(T_k)=3k\)</span>.)</p>
<p>We will now find the variance of this estimate.</p>
<p>Differentiating again</p>
<p><span class="math display">\[f''= \frac{1}{6}4f' f^3+ \frac{1}{6} 4 f' f^3+\frac{1}{6}x 4 f'' f^3+ \frac{1}{6}x 4 f'3 f'f^2\]</span></p>
<p>from which <span class="math inline">\(f''(1)=2+2+\frac{2}{3}f''(1)+18\)</span>, <span class="math inline">\(f''(1)=66\)</span>. Thus <span class="math inline">\(E(T^2)=69\)</span> and <span class="math inline">\(Var(T)=E(T^2)-E(T)^2=60\)</span>, and standard deviation is <span class="math inline">\(\sqrt{60}\approx 7.75\)</span></p>
<p>Remark 1: This low expectation and high variance are due to the high probability of <span class="math inline">\(T=1\)</span>; conditional on <span class="math inline">\(T\neq 1\)</span> the expectation of <span class="math inline">\(T\)</span> is <span class="math inline">\((3-\frac{5}{6})/\frac{1}{6}=13\)</span> (which coincides with the (unconditional) expectation of <span class="math inline">\(T_4\)</span> plus 1, i.e. <span class="math inline">\(3\cdot 4+1=13\)</span>). The conditional variance is the variance of <span class="math inline">\(T_4\)</span> and is <span class="math inline">\(240\)</span>, so the standard deviation is <span class="math inline">\(\approx 15.5\)</span> (see below). This is still indicating high probability of low values and a “fat tail”, but a bit less so than the unconditional case.</p>
<p>On the other hand</p>
<p><span class="math display">\[ (f^{20})''=(20 f^{19}f')'=20( 19 (f')^2f^{18}+f^{19}f'')=20(19\cdot 9+66 )=4740\]</span></p>
<p>so <span class="math inline">\(E(T_{20}^2)=4740+60=4800\)</span> and <span class="math inline">\(Var(T)=4800-60^2=1200\)</span>, and standard deviation <span class="math inline">\(\sqrt{1200}\approx34.64\)</span>, which is more reasonable. It indicates that we may get lucky and get 20 db evidence much sooner than 60 tests, but may get unlucky and have to wait substantially longer, maybe as long as 130 trials or even more. One could compute further moments using the same method (with a longer calculation) to get more of an idea.</p>
<p>For general <span class="math inline">\(k\)</span> we get <span class="math inline">\(E(T_k^2)=k((k-1)9+66) +3k\)</span>, <span class="math inline">\(Var(T_k)=60k\)</span></p>
<p><strong>Alternative solution using Wald identities.</strong></p>
<p>We have in general <span class="math inline">\(E(T_k)=3k\)</span>. Can be deduced much more quickly from the general First Wald Identity: if <span class="math inline">\(T\)</span> is a stopping time with respect to i.i.d. <span class="math inline">\(Z_i\)</span>s with <span class="math inline">\(E(T)\)</span> finite, and <span class="math inline">\(X=\sum_{i=1}^T Z_i\)</span> then</p>
<p><span class="math display">\[E(X)=E(T)E(Z_1).\]</span></p>
<p>In our case, assuming <span class="math inline">\(E(T_k)\)</span> is finite, we have <span class="math inline">\(E_{Z_1}=3\frac{1}{6}-\frac{5}{6}=-\frac{1}{3}\)</span>, <span class="math inline">\(X=-k\)</span> (always), so <span class="math inline">\(E(T_k)=3k\)</span>.</p>
<p>We still need to check <span class="math inline">\(E(T_k)<\infty\)</span>. This follows from (a generalization) of our analysis of the random walk on <span class="math inline">\(\mathbb{Z}\)</span>, but in fact Hoeffding’s inequality implies exponentially small bond on probability <span class="math inline">\(T_k>n\)</span>, so <span class="math inline">\(T_k\)</span> actually has all moments finite (for a general study of moments of <span class="math inline">\(T_k\)</span> see <a href="https://projecteuclid.org/euclid.aop/1176996709">this paper of A. Gut</a>).</p>
<p>Wald’s second identity (under same assumption <span class="math inline">\(E(T)\)</span> finite) is</p>
<p><span class="math display">\[E[(X-E(Z_i)T)^2]=Var(Z_i)E(T).\]</span></p>
<p>For us <span class="math inline">\(X=-k\)</span>, <span class="math inline">\(E(Z_i)=-\frac{1}{3}\)</span>, <span class="math inline">\(E(T_k)=3k\)</span> and</p>
<p><span class="math display">\[Var{Z_i}=\frac{1}{6}(3
+\frac{1}{3})^2+\frac{5}{6}(-1+\frac{1}{3})^2=100/54+20/54=120/54\]</span></p>
<p>Plugging in, this gives:</p>
<p><span class="math display">\[k^2 -2 (-k) (-1/3) 3k +1/9 E(T_k^2)=120/54 \cdot 3k \]</span></p>
<p><span class="math display">\[ -9k^2 +E(T_k^2)=60k \]</span></p>
<p><span class="math display">\[E(T_k^2)=9k^2+60k\]</span></p>
<p><span class="math display">\[ Var (T_k) = 9k^2+60k - 9k^2=60k\]</span></p>
<p>as before.</p>
<p>One advantge of using this method is that it can handle the unapproximated problem as well.</p>
<p>Indeed, if we keep the leftward step <span class="math inline">\(0.97\)</span> instead of making it <span class="math inline">\(1\)</span>, <span class="math inline">\(E(Z_1)\)</span> changes a bit to <span class="math inline">\(-\frac{1}{3}+\frac{5}{6}0.03=\frac{-200+15}{600}=-\frac{37}{120}\)</span>, and the <span class="math inline">\(E(X)\)</span> is no longer <span class="math inline">\(-k\)</span>, but is between <span class="math inline">\(-k\)</span> and <span class="math inline">\(-k-0.97\)</span>, so we get an estimate <span class="math inline">\(\frac{120}{54}k\leq E(T_k)< \frac{120}{54}(k+0.97).\)</span> Similar estimate can be found for the variance of <span class="math inline">\(T_k\)</span> as well.</p>
<h2 id="exercise-4.5">Exercise 4.5</h2>
<p>It seems that from a Bayesian perspective the conjecture is tautological. If “reliability of conclusion” means some precise minimum of posterior odds ratio in favor of the <span class="math inline">\(i\)</span>th hypothesis, then one has to keep collecting data until that odds ratio is achieved. So, tautologically, the minimum ASN is obtained by the procedure in which one stops collecting further data after the requisite odds ratio (for any of the hypothesis under consideration) is attained for the first time.</p>
<p>From the frequentist point of view the situation is more complicated. Instead of posterior probability/odds for a specific hypothesis, the frequentist guarantee on a procedure is in a form of a bound on probability of Type I (falsely rejecting a true hypothesis, low p value) and Type II (accepting a false hypothesis, high power) errors is posited, and a procedure that confirms to these bounds is evaluated based on its ASN. It is relatively easy to see tha for testing binary hypothesis, if Bayesian/sequential procedure always hits the required posterior odds exactly, without overshooting (for example, in the setup of Exercise 4.4 the evidence is always either plus or minus 1 and the bounds are integers), then sequential test will have the smallest ASN, and, thus, in general, it will have closed to the smallest ASN (this is Section A.7 of Wald’s “Sequential Analysis” book). However without such an assumption such optimality of sequential tests is harder to establish. Nonetheless it was done by Wald and Wolfowitz in 1948 (“Optimum character of the sequential ratio test”).</p>
<h2 id="formula-4.67">Formula 4.67</h2>
<p>“The complete beta function” <span class="math inline">\(\int_0^1 f^n (1-f)^{N-n}\)</span> can be evaluated using Bayes’ billiards, see notes on Chapter 6 for details.</p>
<!----
<br/><br/>
To make these results rigorous we only to show $E(T_k)<\infty$. But this follows from the central limit theorem: $P(T_k\geq n+1)$ implies $Z_1+\ldots+Z_{n}>-k$,or
$\frac{Z_1+\ldots+Z_{n}}{\sqrt{n}}>-\frac{k}{\sqrt{n}}$
$\frac{Z_1+\ldots+Z_{n}}{\sqrt{n}}+\frac{\sqrt{n}}{3}>\frac{\sqrt{n}}{3}-\frac{k}{\sqrt{n}}$
Probability of this event is $\Phi_n(\frac{\sqrt{n}}{3}-\frac{k}{\sqrt{n}})$ where $\Phi_n$ is the CDF of $\sqrt{n} (\frac{Z_1+\ldots+Z_{n}}{n}-\mu_Z)$, which converges to a normal with mean $0$ (and variance $Var Z_i=120/54$)
is bounded by $e^{-\alpha n}$, hence so is $P(T_k=n)$, so $E(T_k)$ is finite.
<!----
Consider the probability $p$ of ever getting $-1$ db of evidence. To compute it, we condition on the first step. Either, with probability $1/6$, we move we get that $-1$ right away, or with probability $5/6$ we move $3$ to the right, after which the probability of ever obtaining $-1$ db of evidence is $p^4$ (since the walk is skip-free, we would need to obtain $-1$ evidence 4 times, 3 times to undo the first step, and then one more (we are using the fact that the random walk is memoryless, so that probability of eventually moving left by one is the same no matter from what moment we start counting). We get $p=\frac{1}{6}+\frac{5}{6}p^4$. This has positive second derivative, so is convex, and so has at most 2 roots. One of them is $1$ and another is approximately $0.16732$. We argue below that $p=1$ is not possible, so $p\approx 0.16732$. The probability of
----->
</body>
</html>