JaynesProbabilityTheory/chapter7.html at master · MaksimIM/JaynesProbabilityTheory · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <title>chapter7</title>
  <style>
    code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    span.underline{text-decoration: underline;}
    div.column{display: inline-block; vertical-align: top; width: 50%;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    ul.task-list{list-style: none;}
  </style>
  <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" type="text/javascript"></script>
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
</head>
<body>
<h1 id="the-central-gaussian-or-normal-distribution">The central, Gaussian or normal distribution</h1>
<p><span class="math inline">\(\leftarrow\)</span> <a href="./index.html">Back to Chapters</a></p>
<h2 id="the-herschelmaxwell-derivation-and-kinetic-energy.">The Herschel–Maxwell derivation and kinetic energy.</h2>
<p>Observe that if we postulate that kinetic <span class="math inline">\(E\)</span> energy of a point particle is a function of its velocity <span class="math inline">\(\vec{v}=(v_x, v_y)\)</span> then we can write postulates analogous to the ones in Section 7.2:</p>
<ul>
<li><p>(P1) The total energy is sum of the energies doe to the <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> motion: <span class="math inline">\(E(x,y)=E(x,0)+E(0,y)\)</span></p></li>
<li><p>(P2) This energy should be independent of the angle: <span class="math inline">\(E(r, \theta) = E(r)\)</span></p></li>
</ul>
<p>These are log versions of the conditions in 7.2, so the function that satisfies them <span class="math inline">\(E(x,y)=c (x^2+y^2)\)</span> is (minus) log of Gaussian density. This fact is of use in <a href="https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo">Hamiltonian Monte Carlo</a>.</p>
<h2 id="exercise-7.1">Exercise 7.1</h2>
<p>TO DO</p>
<!---

Possible interpretations:

1) We literally take $\epsilon_n=\sum_1^n \varepsilon_i$ with $\varepsilon_i\sim \epsilon$ and i.i.d. Then $\langle \epsilon_n^2 \rangle=n \langle \epsilon^2\rangle$.


2) We suppose $\epsilon=\sum_{i=1}^{n}\epsilon_i$ of i.i.d. random noises $\epsilon_i$, with $\epsilon_i$ therefore satisfying $\langle \epsilon_i \rangle=0$ and  $n\langle \epsilon_i^2 \rangle=\langle \epsilon^2 \rangle$.

--->
<h2 id="exercise-7.2">Exercise 7.2</h2>
<p>Consider the family of distributions <span class="math inline">\(p_{\mu, \sigma}(v)\)</span>.</p>
<p>We want to express fact that convolution of <span class="math inline">\(p_{\mu, \sigma}(v)\)</span> with <span class="math inline">\(q(v)\)</span> still belongs to the same family. I will prefer to work with parameter <span class="math inline">\(\nu=\sigma^2\)</span> instead. To avoid confusion the variable <span class="math inline">\(v\)</span> will be replaced by <span class="math inline">\(x\)</span>. So we work with the family <span class="math inline">\(p_{\mu, \nu}(x)\)</span>.</p>
<p>Let <span class="math inline">\(\mu_q=\langle \epsilon\rangle_q\)</span> be the mean and <span class="math inline">\(\nu_q=\langle \epsilon^2\rangle_q-\langle \epsilon\rangle_q^2\)</span> the variance of <span class="math inline">\(q\)</span>. Then the new distribution must be <span class="math inline">\(p_{\mu+\mu_q, \nu+\nu_q}(x)\)</span>. At the same time the expansion 7.20 becomes</p>
<p><span class="math display">\[p_{\mu+\mu_q, \nu+\nu_q}(x)=p_{\mu, \nu}(x)-\mu_q \frac{\partial}{\partial x} p_{\mu, \nu}(x)+\frac{1}{2}(\nu_q+\mu_q^2)\frac{\partial^2}{\partial^2 x} p_{\mu, \nu}(x)+\ldots\]</span></p>
<p>Taylor expanding around <span class="math inline">\(\mu, \nu\)</span></p>
<p><span class="math display">\[p_{\mu+\mu_q, \nu+\nu_q}(x)= p_{\mu, \nu}(x)+\mu_q\frac{\partial}{\partial \mu}p_{\mu, \nu}(x)+ \nu_q\frac{\partial}{\partial \nu}p_{\mu, \nu}(x)+ \]</span></p>
<p><span class="math display">\[\frac{1}{2} \mu_q^2 \frac{\partial^2}{\partial^2 \mu}p_{\mu, \nu}(x) +\frac{1}{2}\nu_q^2 \frac{\partial^2}{\partial^2 \nu}p_{\mu, \nu}(x)+  \mu_q \nu_q \frac{\partial^2}{\partial \mu \partial \nu} p_{\mu, \nu}(x)\]</span></p>
<p><span class="math display">\[ \]</span></p>
<p>Now <strong>if</strong> we wanted this to be true for arbitrary (small) <span class="math inline">\(\mu_q, \nu_q\)</span> we would have equality of Taylor coefficients:</p>
<p><span class="math display">\[-\frac{\partial}{\partial x}p_{\mu, \nu}(x)=\frac{\partial}{\partial \mu}p_{\mu, \nu}(x)\]</span></p>
<p><span class="math display">\[\frac{1}{2}\frac{\partial^2}{\partial^2 x} p_{\mu, \nu}(x)=\frac{\partial}{\partial \nu}p_{\mu, \nu}(x)= \frac{1}{2} \frac{\partial^2}{\partial^2 \mu}p_{\mu, \nu}(x)\]</span></p>
<p>where the first equation says that <span class="math inline">\(p_{\mu, \nu}(x)\)</span> is a function of <span class="math inline">\(x-\mu\)</span> and not of <span class="math inline">\(\mu\)</span> and <span class="math inline">\(x\)</span> separately, <span class="math inline">\(p_{\mu, \nu}(x)=f_\nu(x-\mu)\)</span>. From this <span class="math inline">\(\frac{\partial^2}{\partial^2 x} p_{\mu, \nu}(x)=\frac{\partial^2}{\partial^2 \mu} p_{\mu, \nu}(x)\)</span> follows, and we simply recover the more general Gaussina family <span class="math inline">\(p_{\mu,\nu}(x)=\frac{1}{\sqrt{2 \pi \nu}} \exp\{-\frac{(x-\mu)^2}{2 \nu}\}\)</span> as in 7.23.</p>
<p>However, <strong>if</strong> we instead think of <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\nu\)</span> as <span class="math inline">\(\mu(t)\)</span> and <span class="math inline">\(\nu(t)\)</span> so that the family <span class="math inline">\(p_t(x)=p_{\mu(t), \nu(t)}(x)\)</span> is a single-parameter family, then the expansions become expansions in terms of <span class="math inline">\(t\)</span>: with <span class="math inline">\(\mu_q(t)=\mu_q&#39;(0)t+o(t^2)\)</span>, <span class="math inline">\(\nu_q(t)=\nu_q&#39;(0)t+o(t^2)\)</span></p>
<p><span class="math display">\[p_{\mu+\mu_q(t), \nu+\nu_q(t)}(x)=p_{\mu, \nu}(x)+[-\mu_q&#39;(0) \frac{\partial}{\partial x} p_{\mu, \nu}(x)+\frac{1}{2}\nu_q&#39;(0)\frac{\partial^2}{\partial^2 x} p_{\mu, \nu}(x)]t+\]</span></p>
<p><span class="math display">\[p_{\mu+\mu_q(t), \nu+\nu_q(t)}(x)=p_{\mu, \nu}(x)+\frac{\partial}{\partial t}p_{\mu, \nu}(x) t+o(t^2)\]</span></p>
<!--
$$p_{\mu+\mu_q(t), \nu+\nu_q(t)}(x)= p_{\mu, \nu}(x)+[\mu_q'\frac{\partial}{\partial \mu}p_{\mu, \nu}(x)+ \nu_q'\frac{\partial}{\partial \nu}p_{\mu, \nu}(x)]t+ o(t^2)$$

--->
<p>Equating Taylor coefficients:</p>
<p><span class="math display">\[ \frac{\partial}{\partial t}p_{t}(x)= -\mu_q&#39; \frac{\partial}{\partial x} p_{t}(x)+\frac{1}{2}\nu_q&#39;\frac{\partial^2}{\partial^2 x} p_{t}(x)\]</span></p>
<p>This is a <a href="https://en.wikipedia.org/wiki/Fokker%E2%80%93Planck_equation">Fokker-Plank equation</a>, albeit a very special one, with <span class="math inline">\(\mu(x,t)=\mu&#39;(0)\)</span>, <span class="math inline">\(\sigma^2(x,t)=\nu&#39;(0)\)</span>, corresponding to the stochastic process where the drift <span class="math inline">\(\mu\)</span> and diffusion coefficient <span class="math inline">\(\nu/2\)</span> are both constant. Denote <span class="math inline">\(\mu&#39;(0)=m\)</span> and <span class="math inline">\(\nu&#39;(0)=v\)</span>.</p>
<p>Changing coordinates to <span class="math inline">\(y(x,t)=x-mt\)</span> aka <span class="math inline">\(x(y,t)=y+mt\)</span>, we have <span class="math display">\[p_t(x(y,t))=p_t(y+mt)=:q_t(y)\]</span></p>
<p>and compute by chin rule</p>
<p><span class="math display">\[\frac{\partial}{\partial t}q_{t}(y)=\frac{\partial}{\partial t}p_{t}(x(y,t))=\frac{\partial}{\partial t}p_{t}(y+mt)+ m  \frac{\partial}{\partial x}p_{t}(y+mt)\]</span></p>
<p>while <span class="math display">\[\frac{1}{2}v\frac{\partial^2}{\partial^2 y} q_{t}(y)=\frac{1}{2}v\frac{\partial^2}{\partial^2 y} p_{t}(x(y,t))=\frac{1}{2}v\frac{\partial^2}{\partial^2 x} p_{t}(y+mt) \]</span></p>
<p>So the substitution we made reduces the Fokker-Plank equation we have (with drift) to the diffusion equation (without drift) i.e. 7.22 (with <span class="math inline">\(\sigma^2=t\)</span>), which by 7.23 has solution <span class="math inline">\(q_t(y)=\frac{1}{\sqrt{2\pi t }}\exp\{-\frac{y^2}{2t}\}\)</span>, or, after substitution</p>
<p><span class="math display">\[p_t(x)=\frac{1}{\sqrt{2\pi t }}\exp\{-\frac{(x-mt)^2}{2t}\}\]</span></p>
<p>This has variance <span class="math inline">\(\sigma^2=t\)</span> so we can rewrite it as <span class="math inline">\(p(x)=\frac{1}{\sqrt{2\pi \sigma^2 }}\exp\{-\frac{(x-m\sigma^2)^2}{2\sigma^2}\}\)</span>, as in the formulation of the exercise.</p>
<p><!---

 $$=\mu_q'\frac{\partial}{\partial \mu}p_{\mu, \nu}(x)+ \nu_q'\frac{\partial}{\partial \nu}p_{\mu, \nu}(x)$$

 ---></p>
<p>## Exercise 7.3 preliminary remarks.</p>
<p>I interpret this as follows. We are now considering evolution of beliefs about noise-generating process, and want to update this based on data and see that as the amount of data grows our posterior beliefs about the noise-generating process will imply beliefs about frequencies of noise that will converge to those produced by true frequencies <span class="math inline">\(f(e)\)</span>.</p>
<p>In order to talk about the above meaningfully in mathematical sense, one needs to define some space of noise-genera processes in such a way that formulating a prior probability over it is possible, and also define in what sense the posterior beliefs about frequencies “converge”.</p>
<p>Since we only observe real-valued noise data (and no other type of information about the noise-generating processes) the space of noise-generating processes can be taken to be the set of real-valued stochastic processes. Now this is still very large. Now one could either 1) take only noise-generating processes that generate noise in i.i.d. way 2) focus only on the beliefs about “frequencies” i.e. how often various values (close to) various <span class="math inline">\(e\)</span>s are observed - or both.</p>
<p>The intuition is that those processes that produce incorrect frequency of <span class="math inline">\(e\)</span>s will be suppressed in he update (due to a mechanism like that in the Borel’s law of large numbers), leaving only the process that do have the correct frequencies in the “posterior distribution over processes”, whatever that means. To illustrate some of the difficulties, consider the simple case of discrete time and processes that just i.i.d. sample from some probability distribution. Consider a distribution which moreover has a density <span class="math inline">\(p\)</span>. Then a particular dataset <span class="math inline">\(e_1, \ldots, e_n\)</span> has likelihood <span class="math inline">\(\prod_{i=1}^{N} p(e_i)\)</span>. At each finite <span class="math inline">\(N\)</span> the closer our distribution <span class="math inline">\(p\)</span> is to the empirical distribution(of <span class="math inline">\(\frac{1}{N}\sum_i \delta{e_i}\)</span>) the higher the likelihood; this seems problematic. Perhaps one has to discretize the set of possible <span class="math inline">\(e\)</span>s and then take limit, or simply use some more sophisticated analysis.</p>
<p>## Footnote 12</p>
<p>####General distributions</p>
<p><strong>Sum of weights.</strong></p>
<p>We are doing MLE for the location parameter <span class="math inline">\(\mu\)</span>, meaning <span class="math inline">\(p(\mu, y)=f(\mu-y)\)</span>. The likelihood <span class="math inline">\(L_{\vec{y}}(\mu)=\prod_i f(\mu-y_i)\)</span>.</p>
<p>Suppose we have, for all <span class="math inline">\(\vec{y}\)</span> near <span class="math inline">\(\vec{y}_0\)</span>, an isolated local minimum of <span class="math inline">\(L_{\vec{y}}\)</span> at <span class="math inline">\(\hat{\mu}(\vec{y})\)</span>. Suppose further we can write <span class="math inline">\(\hat{\mu}(\vec{y}) = \sum y_i w_i(\vec{y})\)</span> with <span class="math inline">\(w_i\)</span> (continuous) functions of the differences <span class="math inline">\(y_k-y_l\)</span>. Shift each <span class="math inline">\(y_i\)</span> by a small <span class="math inline">\(\delta\)</span>, to <span class="math inline">\(\tilde{y}_i=y_i+\delta\)</span>. Then, since <span class="math inline">\(\mu\)</span> is a location parameter, likelihood is also shifted: <span class="math inline">\(L_{\tilde{\vec{y}}}(\mu+\delta)=L_{\vec{y}}(\mu)\)</span> so</p>
<p><span class="math display">\[\hat{\mu}(\tilde{\vec{y}})=\hat{\mu}(\vec{y})+\delta.\]</span></p>
<p>On other hand we have <span class="math inline">\(\hat{\mu}(\vec{y})=\sum y_i w_i(\vec{y})\)</span> and since all <span class="math inline">\(w_i(\vec{y})\)</span> are unchanged by the shift plugging in <span class="math inline">\(\tilde{y}\)</span> we get</p>
<p><span class="math display">\[\hat{\mu}(\tilde{\vec{y}})=\sum (y_i+\delta)w_i=\hat{\mu}(y)+\delta(\sum w_i(\vec{y})).\]</span></p>
<p>We conclude <span class="math inline">\(\sum_i w_i(\vec{y})=1\)</span>.</p>
<p><strong>Is MLE for location parameter a weighted average?</strong></p>
<p><strong>Cases <span class="math inline">\(n=1\)</span> and <span class="math inline">\(n=2\)</span>.</strong></p>
<p>Now we look at wether <span class="math inline">\(\hat{\mu}(\vec{y}) = \sum y_i w_i(\vec{y})\)</span> is indeed true.</p>
<p>It is instructive to examine the case <span class="math inline">\(n=1\)</span>. Then we are just looking at local optima of <span class="math inline">\(f(\mu-y_1)\)</span>. If <span class="math inline">\(f\)</span> is unimodular with unique maximum at <span class="math inline">\(0\)</span> then <span class="math inline">\(\hat{mu}=y\)</span> and of course <span class="math inline">\(w\)</span> is - as prescribed - function of no inputs that “sums” to 1, i.e. is the constant 1. For general <span class="math inline">\(f\)</span> we have various local maxima <span class="math inline">\(m_j\)</span> and MLE <span class="math inline">\(\hat{\mu}_j=y+m_j\)</span> which are not in the required form. It may be prudent to assume unimodality and peak at <span class="math inline">\(0\)</span> for this problem.</p>
<p>For <span class="math inline">\(n=2\)</span> such a form <span class="math inline">\(\hat{\mu}(y_1, y_2)= y_1 w_1(y_1-y_2)+y_2 w_2(y_1-y_2)\)</span> is achievable as follows. Denote <span class="math inline">\(\frac{y_1-y_2}{2}\)</span> by <span class="math inline">\(y\)</span>. If <span class="math inline">\(y=0\)</span> the two data points coincide and <span class="math inline">\(L_{\vec{y}}(\mu)=f(\mu-y_1)^2\)</span> which has the same minima as <span class="math inline">\(f(\mu-y_1)\)</span> - we effectively have only one data point, a case which we have examined above. So below we assume <span class="math inline">\(y\neq 0\)</span>.</p>
<p>Then by translating by <span class="math inline">\(\delta=\frac{y_1+y_2}{2}\)</span> we have <span class="math inline">\(\hat{\mu}(y_1, y_2)=\frac{y_1+y_2}{2}+\hat{\mu}(-y, y)\)</span>. Write <span class="math inline">\(\hat{\mu}(-y, y)=y w(y)\)</span>. Then</p>
<p><span class="math display">\[\hat{\mu}(y_1, y_2)=\frac{y_1+y_2}{2}+\hat{\mu}(-y, y)\]</span></p>
<p><span class="math display">\[=\frac{y_1+y_2}{2}+\frac{y_1-y_2}{2} w(y_1-y_2)=\]</span></p>
<p><span class="math display">\[y_1(w_1(y_1-y_2)) +y_2 w_2(y_1-y_2)\]</span></p>
<p>where</p>
<p><span class="math display">\[w_1(y_1-y_2)=\frac{1}{2}+\frac{1}{2}w(y_1-y_2)\]</span> <span class="math display">\[w_2(y_1-y_2)=\frac{1}{2}-\frac{1}{2}w(y_1-y_2).\]</span></p>
<p>We note that if <span class="math inline">\(f\)</span> is unimodular with maximum at <span class="math inline">\(0\)</span> then <span class="math inline">\(|\hat{\mu}(-y,y)|\leq |y|\)</span> so <span class="math inline">\(|w(y)|\leq 1\)</span> and so <span class="math inline">\(w_1, w_2\geq 0.\)</span></p>
<p><strong>MLE as weighted average, <span class="math inline">\(n&gt;2\)</span>.</strong></p>
<p>For larger <span class="math inline">\(n\)</span> we can still look at <span class="math inline">\(\delta=\frac{1}{n} \sum y_i\)</span>, take <span class="math inline">\(\tilde{\vec{y}}=(y_1-\delta, \ldots, y_n-\delta)\)</span>. Observe that <span class="math inline">\(\tilde{\vec{y}}\)</span> is a vector that depends only on <span class="math inline">\(\vec{d}\)</span> where <span class="math inline">\(\vec{d}\)</span> is the vector of all pairwise differences <span class="math inline">\(y_k-y_l\)</span>.</p>
<p>Now</p>
<p><span class="math display">\[\hat{\mu}(\vec{y})=\delta+\hat{\mu}(\tilde{\vec{y}})=\delta+\mu(\vec{d})\]</span></p>
<p>Pick ANY <span class="math inline">\(\vec{\alpha}=(\alpha_1, \ldots, \alpha_n)\)</span> with <span class="math inline">\(\sum \alpha_i=0\)</span>. Then <span class="math inline">\(\vec{\alpha}\cdot \vec{y}=\sum \alpha_i y_i\)</span> is a function of <span class="math inline">\(\vec{d}\)</span>.</p>
<p>We can then define <span class="math inline">\(w(\vec{d})\)</span> via</p>
<p><span class="math display">\[\mu(\vec{d})= (\vec{\alpha}\cdot \vec{y}) w(\vec{d})\]</span></p>
<p>and recover</p>
<p><span class="math display">\[\hat{\mu}(\vec{y})=\sum y_i w_i(\vec{d})\]</span></p>
<p>with</p>
<p><span class="math display">\[w_i=\frac{1}{n}+\alpha_i w(\vec{d})\]</span></p>
<p>for all <span class="math inline">\(i\)</span>.</p>
<p>This is non-unique if <span class="math inline">\(n&gt;2\)</span> (ther are many <span class="math inline">\(\vec{\alpha}\)</span> leding to different <span class="math inline">\(w_i(\vec{d})\)</span>).</p>
<p>Locally near a specific <span class="math inline">\(\vec{y}\)</span> we can pick <span class="math inline">\(\alpha=\tilde{\vec{y}}\)</span>.</p>
<p>Then <span class="math inline">\(w(\vec{d})=\frac{\hat{\mu}(\tilde{\vec{y}})}{|\tilde{\vec{y}}|^2 }\)</span>, so that <span class="math inline">\(|\tilde{\vec{y}}|^2 \geq \tilde{y}^2_{min}+\tilde{y}^2_{max}\geq 2 |\tilde{y}_{min} \tilde{y}_{max}|\)</span>. For unimodal <span class="math inline">\(f\)</span> peaked at zero we still have <span class="math inline">\(\tilde{y}_{min}\leq\hat{\mu}(\tilde{\vec{y}})\leq \tilde{y}_{max}\)</span>. Putting these together we have</p>
<p><span class="math display">\[\alpha_i w(\vec{d})=\tilde{y}_i w(\vec{d}) \geq -\frac{\tilde{y}_{min}\tilde{y}_{max}}{|\tilde{\vec{y}}|^2}\geq -\frac{1}{2}\]</span></p>
<p>implying <span class="math inline">\(w_i(\vec{d})\geq \frac{1}{n}-\frac{1}{2}\)</span>. Similarly,</p>
<p><span class="math display">\[\alpha_i w(\vec{d})\leq \max (\frac{\tilde{y}^2_{min}}{|\tilde{\vec{y}}|^2}, \frac{\tilde{y}^2_{max}}{|\tilde{\vec{y}}|^2})\leq \frac{n-1}{n}\]</span></p>
<p>implying <span class="math inline">\(w_i(\vec{d})\leq \frac{n-1}{n}+\frac{1}{n}=1\)</span>.</p>
<p>Perhaps more clever choice can guarantee positivity of <span class="math inline">\(w_i\)</span>s as well.</p>
<!--Moreover, it is not clear if one could pick $w_i(\vec{d})$ in such a way that $w_i(\vec{d})\geq 0$ for all $\vec{d}$.
--->
<p><strong>Cauchy distribution, mostly <span class="math inline">\(n=2\)</span>.</strong></p>
<p>We illustrate this in the case when the sampling distribution is Cauchy: <span class="math inline">\(p(y|\mu)=\frac{1}{\pi}\frac{1}{1+(y-\mu)^2}\)</span>.</p>
<p>For general <span class="math inline">\(n\)</span> and the sample <span class="math inline">\(y_1, \ldots, y_n\)</span> the likelihood is <span class="math inline">\(\prod_{i=1}^n \frac{1}{\pi}\frac{1}{1+(y_i-\mu)^2}\)</span>, and log likeliehood is up to a constant <span class="math inline">\(L_{\vec{y}}(\mu)=\sum_i -\ln(1+(y_i-\mu)^2)\)</span>. The extremality condition is <span class="math inline">\(\frac{d}{d\mu}L_{\vec{y}}(\mu)=0\)</span> i.e. <span class="math inline">\(\sum_i^n \frac{y_i-\mu}{1+(y_i-\mu)^2}=0.\)</span> This is in general equivalent to a degree <span class="math inline">\(2n-1\)</span> polynomial equation in <span class="math inline">\(\mu\)</span> - there are many local optima for the likelihood.</p>
<p>Consider now the case <span class="math inline">\(n=2\)</span>.</p>
<p>As before, we write <span class="math inline">\(y=\frac{y_1-y_2}{2}\)</span> consider the problem shifted by <span class="math inline">\(\delta=\frac{y_1+y_2}{2}\)</span>.</p>
<p>The optimality equation becomes <span class="math inline">\(\frac{y-\hat{\mu}}{1+(y-\hat{\mu})^2}-\frac{y+\hat{\mu}}{1+(y+\hat{\mu})^2}=0\)</span> so <span class="math inline">\(f(x)=\frac{x}{1+x^2}=\frac{1}{x+\frac{1}{x}}\)</span> has <span class="math inline">\(f(y-\hat{\mu})=f(y+\hat{\mu})\)</span>. Either <span class="math inline">\(y-\hat{\mu}=y+\hat{\mu}\)</span>, i.e. <span class="math inline">\(\hat{\mu}=0\)</span> or <span class="math inline">\((y-\hat{\mu})(y+\hat{\mu})=1\)</span>, <span class="math inline">\(\hat{\mu}=\pm\sqrt{y^2-1}\)</span>. This last pair of solution is real only if <span class="math inline">\(|y|&gt;1\)</span>.</p>
<p>Suppose that in fact <span class="math inline">\(|y|&gt;1\)</span>. Then one over the likelihood is a positive fourth degree polynomial which we now know has three local extrema - <span class="math inline">\(-\sqrt{y^2-1}, 0, \sqrt{y^2-1}\)</span>. Therefore these extrema must be non-degenerate and be min, max, min. Correspondingly, the (log)likelihood extrema must be max, min, max.</p>
<p>The MLE estimate is thus indifferent between <span class="math display">\[\hat{\mu}_1=\frac{y_1+y_2}{2}+\sqrt{(\frac{y_1-y_2}{2})^2-1}\]</span> and</p>
<p><span class="math display">\[\hat{\mu}_2=\frac{y_1+y_2}{2}-\sqrt{(\frac{y_1-y_2}{2})^2-1}.\]</span></p>
<p>Let’s see how this can be written in the form <span class="math inline">\(\hat{\mu}=y_1w_1(y_1-y_2)+y_2w_2(y_1-y_2)\)</span>. We treat <span class="math inline">\(\hat{\mu}_1\)</span></p>
<p>As prescribed,</p>
<p><span class="math display">\[w(y)=\sqrt{y^2-1}=y\cdot sgn(y) \sqrt{\frac{1}{4}-\frac{1}{y^2}}\]</span></p>
<p>so</p>
<p><span class="math display">\[w_1(y)=\frac{1}{2}+ sgn(y) \sqrt{\frac{1}{4}-\frac{1}{y^2}}\]</span></p>
<p>and</p>
<p><span class="math display">\[w_2(y)=\frac{1}{2}- sgn(y) \sqrt{\frac{1}{4}-\frac{1}{y^2}}\]</span></p>
<p>and indeed,</p>
<p><span class="math display">\[y_1w_1(y_1-y_2)+y_2w_2(y_1-y_2)=\]</span></p>
<p><span class="math display">\[\frac{y_1+y_2}{2}+(y_1-y_2)\cdot sgn(y_1-y_2) \sqrt{\frac{1}{4}-\frac{1}{(y_1-y_2)^2}}=\]</span></p>
<p><span class="math display">\[\frac{y_1+y_2}{2}+\sqrt{(\frac{y_1-y_2}{2})^2-1}=\]</span></p>
<p><span class="math display">\[y_1w_1(y_1-y_2)+y_2w_2(y_1-y_2)= \hat{\mu}_1\]</span></p>
<p>Observe that <span class="math inline">\(w_1(y)+w_2(y)=1\)</span> as expected. Observe also that for <span class="math inline">\(\hat{\mu}_2\)</span> we would obtain <span class="math inline">\(\tilde{w}_1=w_2\)</span> and <span class="math inline">\(\tilde{w}_2=w_1\)</span> - the individual <span class="math inline">\(\mu_i\)</span> and <span class="math inline">\(w_i\)</span> are not symmetric in exchanging <span class="math inline">\(y_i\)</span>s, but the sets of <span class="math inline">\(\mu\)</span>s and <span class="math inline">\(w\)</span>s are.</p>
<p><!---
 **Some more on $n>2$.**


Still considering Cauchy distribution, but for general $n$, the inverse of likelihood is a polynomial in $\mu$ of degree $2n$ with coefficients some symmetric polynomials in $y_i$, and its derivative is a polynomial in $\mu$ of degree $2n-1$ with coefficients some symmetric polynomials in $y_i$, which we can write as $Q(\mu, y_1, \ldots, y_n)$. The hypersurface $Q(\mu, y_1, \ldots, y_n)=Q(\mu, \vec{y})=0$ in $\mathbb{R}^{n+1}$ projects to $\vec{y}=y_1, \ldots, y_n$. In the $\vec{y}$ space  and if we exclude those $\vec{y}$ tuples for which the discriminant of $Q$ is zero we have over $\mathbb{R}^n\setminus D$  a covering map, with varying number of sheets $k$, being the number of soultions of $Q(\mu, \vec{y})=0$ in $\mu$ for given $\vec{k}$. Locally in $\mathbb{R}\setminus D$, there are $k$ "inverse of projection" maps $M:C\to \mathbb{R}$ that pick out a particular solution out of the $k$ in a continuous (and in fact smooth) way (a solution of $Q(\mu, \vec{y})=0$ is a local extremum of the likelihood function). Our task is to show that at least those $M$ that correspond to maxima of likeliehood can be written as $M(\vec{y})=\sum_i y_i w_i(\vec{d})$ where $\vec{d}$ is the vector of pairwise differences $y_k-y_l$ and $w_i(\vec{d})>0$. (Similar description applies to other distributions, except everything is no longer necessarily polynomial.)

I suspect considering the action of the permutation group on $y_i$ would be helpful.
----></p>
<p>## Exercise 7.4</p>
<p>We are minimizing <span class="math inline">\(\vec{w}^T C \vec{w}\)</span>. Let’s minimize over the hyperplane <span class="math inline">\(\sum w_i=1\)</span>. Since <span class="math inline">\(C\)</span> is positive definite on <span class="math inline">\(\mathbb{R}^n\)</span>, at infinity the values are large and positive. So the minimum is achieved at finite distance and must satisfy the Lagrange multiplier equation is <span class="math inline">\(C\vec{w}=\lambda \vec{1}\)</span>, so <span class="math inline">\(\vec{w}=\lambda C^{-1}\vec{1}\)</span> and <span class="math inline">\(\sum w_i=1\)</span> gives, with <span class="math inline">\(C^{-1}=K\)</span> the answer <span class="math inline">\(w_i=\sum_j K_{ij}/\sum_{i,j}K_{ij}\)</span>, as wanted (note that the denominator is <span class="math inline">\(\vec{1}^T K \vec{1}&gt;0\)</span>).</p>
<p>Corresponding value is</p>
<p><span class="math display">\[\vec{w}^T C \vec{w}=\lambda^2 (C^{-1}\vec{1})^T C (C^{-1}\vec{1})=\lambda=(\sum_{ij} K_{ij})^{-1}\]</span></p>
<p>HOWEVER this answer satisfies does not always satisfy the constraints <span class="math inline">\(w_i\geq 0\)</span>: consider <span class="math inline">\(C=\begin{pmatrix}1 &amp; 4\\ 4&amp; 17\end{pmatrix}\)</span> so that <span class="math inline">\(K=\begin{pmatrix}17 &amp; -4\\ -4&amp; 1\end{pmatrix}\)</span>; then <span class="math inline">\(w=(1.3, -0.3)^T\)</span>.</p>
<h2 id="exercise-7.5">Exercise 7.5</h2>
<p>See https://www.cs.toronto.edu/~yuvalf/CLT.pdf</p>
<h2 id="section">7.84</h2>
<p><span class="math display">\[\exp\{xa-\frac{a^2}{2}\}=\exp\{\frac{x^2}{2}\}\exp\{-\frac{(x-a)^2}{2}\}\]</span></p>
<p>so</p>
<p><span class="math display">\[\frac{d^n}{da^n} \exp\{xa-\frac{a^2}{2}\}=\exp\{\frac{x^2}{2}\} \frac{d^n}{da^n}\left(\exp\{-\frac{(x-a)^2}{2}\}\right)\]</span></p>
<h2 id="section-1">7.85</h2>
<p><span class="math display">\[\frac{\phi(x-a)\phi(x-b)}{\phi(x)}= \phi(x)\left(\sum_n R_n(x)\frac{a^n}{n!}\right)\left(\sum_m R_m(x)\frac{b^m}{m!}\right)\]</span></p>
<p>LHS: Observe <span class="math display">\[\phi(x-a)\phi(x-b)= \phi(x)\phi(x-(a+b)) \exp\{ab\}\]</span></p>
<p>Then</p>
<p><span class="math display">\[ \int \frac{\phi(x-a)\phi(x-b)}{\phi(x)} dx=\int \phi(x-(a+b)) \exp\{ab\} dx=\exp\{ab\}\]</span></p>
<p>As a power series in <span class="math inline">\(ab\)</span> it is <span class="math inline">\(\sum_i \frac{a^ib^i}{i!}\)</span>.</p>
<p>RHS:</p>
<p><span class="math inline">\(\sum_{n,m} (\int \phi(x) R_n(x)R_m(x)dx) \frac{a^n b^m}{n! m!}\)</span></p>
<p>Equating coefficients we get 7.85.</p>
<h2 id="section-2">7.86</h2>
<p>From 7.83 <span class="math inline">\(\phi(x-y)=\phi(x)\sum_m R_m(x)\frac{y^m}{m!}\)</span> so</p>
<p><span class="math display">\[\phi(x-y)R_n(x)=R_n(x)\phi(x)\sum_m R_m(x)\frac{y^m}{m!}\]</span></p>
<p>and integrating and using 7.85 we get</p>
<p><span class="math display">\[\int \phi(x-y) R_n(x)dx= y^n.\]</span></p>
<h2 id="section-3">7.89</h2>
<p><span class="math display">\[\exp\{xa\}\exp\{-a^2/2\}=\sum_k \frac{x^k a^k}{k!}\sum_m \frac{a^{2m}}{(-2)^m m!}\]</span></p>
<p>Isolating the term in front of <span class="math inline">\(n=k+2m\)</span> gives 7.89</p>
</body>
</html>