JaynesProbabilityTheory/chapter12.html at master · MaksimIM/JaynesProbabilityTheory · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <title>chapter12</title>
  <style>
    code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    span.underline{text-decoration: underline;}
    div.column{display: inline-block; vertical-align: top; width: 50%;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    ul.task-list{list-style: none;}
  </style>
  <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" type="text/javascript"></script>
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
</head>
<body>
<h1 id="ignorance-priors-and-transformation-groups">Ignorance priors and transformation groups</h1>
<p><span class="math inline">\(\leftarrow\)</span> <a href="./index.html">Back to Chapters</a></p>
<h3 id="comments-on-12.4.1">Comments on 12.4.1</h3>
<p>I find the discussion in 12.4 somewhat challenging. “Statistical decision theory and bayesian analysis” by James O. Berger was a better reference for me.</p>
<p>To put the main issue up first: The prior 12.27 is in fact “best” in this problem, but not because 12.18 (which has multiple issues) is better than 12.30, but because the deduction of 12.36 from 12.30 is flawed, and a better analysis of 12.30 (using the framework of invarinat decision rules) does indeed lead to 12.27!</p>
<p>Jaynes’s paragraph (page 381) trying to find problem with 12.30 is itself flawed. There is in fact no prefered choice of the <span class="math inline">\(x=0\)</span> point. However, this does not invalidate 12.30. The equations 12.30 take a particular form once the origin <span class="math inline">\(x=0\)</span> is chosen, but even when the origin is not selected, the group of affine transformations encoded in 12.30 acts on the affine line <span class="math inline">\(\mathbb{A}^1\)</span>. This action is abstract, and is only represented in coordinates via 12.30. This is in contrast to 12.18 which represents an action of the commutative group <span class="math inline">\(\mathbb{R}\times \mathbb{R}_+\)</span> on the 2D space of parameters, as well as the 3D space of “parameters together with coordinates”, but not any action on the space of <span class="math inline">\(x\)</span>s from which we draw the observed data. The idea of “rescaling from the current mean” encoded in 12.18 seems interesting, but hard to justify or generalize.</p>
<p>I flesh some of this out in the writeup below, which is largely based Berger’s book, particularly sections 3.3.2 and 6.6.</p>
<hr />
<p>The word “invariance” presupposes a group action. The most natural setting is that in which a group <span class="math inline">\(G\)</span> acts on the space <span class="math inline">\(X\)</span> in which we get data.</p>
<hr />
<p>Example 1: (location-scale in 1D) Take <span class="math inline">\(X=\mathbb{A}^1\)</span> the affine line, and <span class="math inline">\(G\)</span> the group of (orientation-preserving) affine transformations of <span class="math inline">\(X\)</span>. As is common (for example in computer graphics), we represent elements of <span class="math inline">\(X\)</span> as vectors <span class="math inline">\(\vec{x}=\begin{pmatrix}x\\1\end{pmatrix}\)</span> (the line <span class="math inline">\(y=1\)</span> inside <span class="math inline">\(\mathbb{R}^2\)</span>) and then every element of <span class="math inline">\(G\)</span> has unique representation as <span class="math inline">\(g=\begin{pmatrix}a&amp;b\\0&amp;1\end{pmatrix}\)</span> with <span class="math inline">\(a&gt;0\)</span> (i.e. the set of linear transforms of <span class="math inline">\(\mathbb{R}^2\)</span> that take the line <span class="math inline">\(y=1\)</span> to itself in orientation-preserving way) so that <span class="math display">\[g\cdot \vec{x}=\begin{pmatrix}a&amp;b\\0&amp;1\end{pmatrix}\begin{pmatrix}x\\1\end{pmatrix} =\begin{pmatrix}ax+b\\1\end{pmatrix}\]</span></p>
<p>is indeed affine-linear.</p>
<p>The product in <span class="math inline">\(G\)</span> is then:</p>
<p><span class="math display">\[\begin{pmatrix}a&amp;b\\0&amp;1\end{pmatrix} \begin{pmatrix}a&#39;&amp;b&#39;\\0&amp;1\end{pmatrix}=\begin{pmatrix}aa&#39;&amp;ab&#39;+b\\0&amp;1\end{pmatrix}\]</span></p>
<p>This is a non-commutative group.</p>
<p>We have the following pair of group homomorphisms (known as a short exact sequence):</p>
<p><span class="math display">\[\mathbb{R}\hookrightarrow G\twoheadrightarrow \mathbb{R}_+ \]</span></p>
<p>Here <span class="math inline">\(\mathbb{R}\)</span> is the real numbers under addition, and <span class="math inline">\(\hookrightarrow\)</span> sends a number to a translation by that amount; on the other hand <span class="math inline">\(\mathbb{R}_{+}\)</span> is positive reals under multiplication, and <span class="math inline">\(\twoheadrightarrow\)</span> sends an affine map to its stretching factor.</p>
<p>These maps do not depend on any particular way of representing <span class="math inline">\(X\)</span> and <span class="math inline">\(G\)</span>. On the other hand, there are “backward maps” <span class="math inline">\(\mathbb{R}_+ \to G\)</span> and <span class="math inline">\(G\to \mathbb{R}\)</span>, but those are depend on additional choices, like the choice to represent <span class="math inline">\(X\)</span> and <span class="math inline">\(G\)</span> in terms of vectors and matrices as above. (In group theory one says that the sequence above is split, and thus <span class="math inline">\(G\)</span> is “semi-direct product of <span class="math inline">\(\mathbb{R}_+\)</span> acting on <span class="math inline">\(\mathbb{R}\)</span>, or”a split extension of <span class="math inline">\(\mathbb{R}_+\)</span> by <span class="math inline">\(\mathbb{R}\)</span>"")</p>
<p>Example 2: (Location-scale in higher dimensions) When <span class="math inline">\(X=\mathbb{A}^n\)</span> the location-scale group <span class="math inline">\(G\)</span> is the subgroup of those affine transforms of <span class="math inline">\(X\)</span> whose linear part is a pure rescaling by a (positive) factor. We then have <span class="math inline">\(\mathbb{R}^n \hookrightarrow G \twoheadrightarrow \mathbb{R}_+\)</span>.</p>
<hr />
<p>We are interested in distribution of the data, i.e. in probability distributions over <span class="math inline">\(X\)</span>. Thus we consider a collection of <span class="math inline">\(\mathcal{P}\)</span> of distributions over <span class="math inline">\(X\)</span>.</p>
<p><strong>Defintion</strong>: The collection <span class="math inline">\(\mathcal{P}\)</span> is said to be invariant under the action of <span class="math inline">\(G\)</span> if for any <span class="math inline">\(p\in P\)</span> and any <span class="math inline">\(g\in G\)</span> the pushforward distribution <span class="math inline">\(g_* p\)</span> is also in <span class="math inline">\(\mathcal{P}\)</span>.</p>
<p>Since <span class="math inline">\((g\cdot h)_* p=g_*( h_* p)\)</span>, this means that <span class="math inline">\(G\)</span> acts on <span class="math inline">\(\mathcal{P}\)</span> as well. If <span class="math inline">\(\mathcal{P}\)</span> is a parametric family, parametrized by space <span class="math inline">\(\Theta\)</span> then weconclude that <span class="math inline">\(G\)</span> acts on <span class="math inline">\(\Theta\)</span>.</p>
<hr />
<p>Example 1 continued:</p>
<ol type="a">
<li>Let <span class="math inline">\(\mathcal{P}\)</span> be the collection of all Gaussian distributions on <span class="math inline">\(\mathbb{A}^1\)</span>. Note that in order to talk about the collection <span class="math inline">\(\mathcal{P}\)</span> specifying the origin is not necessary. This collection is a invariant under the action of <span class="math inline">\(G\)</span>. If we pick an origin, then we can use mean <span class="math inline">\(\mu\)</span> and standard deviation <span class="math inline">\(\sigma\)</span> as parameters for <span class="math inline">\(\mathcal{P}\)</span>. We can also use representation of <span class="math inline">\(G\)</span> by matrices that we have discussed above. Then <span class="math inline">\(\Theta=\mathbb{R}\times \mathbb{R}_+\)</span> is the parameter space and <span class="math inline">\(G\)</span> acts on by sending <span class="math inline">\((\mu, \sigma)\)</span> to <span class="math inline">\((a\mu+b,a\sigma)\)</span>.</li>
</ol>
<p>(Together with <span class="math inline">\(x\)</span> being sent to <span class="math inline">\(ax+b\)</span> this appears as 12.30 in Jaynes.)</p>
<ol start="2" type="a">
<li><p>Let <span class="math inline">\(\mathcal{P}\)</span> be the collection of all Cauchy distributions on <span class="math inline">\(\mathbb{A}^1\)</span>; after choice of the origin this is the family with pdfs <span class="math inline">\(\frac{1}{\pi \gamma \left[ 1+ \left(\frac{(x-x_0)}{\gamma}\right)^2 \right]}\)</span>. We no longer have mean or standard deviation available as a parameters, but we do have location <span class="math inline">\(x_0\)</span> and scale <span class="math inline">\(\gamma\)</span>. They transform under <span class="math inline">\(G\)</span> by the same formulas as <span class="math inline">\((\mu, \sigma)\)</span> did before.</p></li>
<li><p>Let <span class="math inline">\(\mathcal{P}\)</span> be the collection of all mixures of normal distributions on <span class="math inline">\(\mathbb{A}^1\)</span>. After picking the origin, this is the collection of distributions which can be written as <span class="math inline">\(p=\sum_i^m w_i \mathcal{N}(\mu_i, \sigma_i^2)\)</span> for some <span class="math inline">\(m\in \mathbb{N}\)</span> and <span class="math inline">\(w_i&gt;0\)</span> with <span class="math inline">\(\sum w_i=1\)</span>. This collection is invariant under <span class="math inline">\(G\)</span>. When <span class="math inline">\(m\)</span> is fixed the supbfamily <span class="math inline">\(\mathcal{P}_m\)</span> it is parametric, with parameters being <span class="math inline">\(3m\)</span> dimensional vectors <span class="math inline">\((\mu_i, \sigma_i, w_i)\)</span>. If <span class="math inline">\(m\)</span> is not fixed, however, the collection <span class="math inline">\(\mathcal{P}\)</span> is not parametric in the usual sense of the word.</p></li>
</ol>
<p>Example 3: Consider the family <span class="math inline">\(\mathcal{P}\)</span> of all Gamma distributions on <span class="math inline">\(X=\mathbb{R}_+\)</span>, with pdfs <span class="math inline">\(\Gamma_{(k, \theta)}(x)=\frac{1}{\Gamma(k)\theta} (\frac{x}{\theta})^{k-1} \exp(-\frac{x}{\theta})\)</span>. Here <span class="math inline">\(G=\mathbb{R}+\)</span> acts on <span class="math inline">\(X\)</span> by multiplication, <span class="math inline">\(\mathcal{P}\)</span> is invariant, and the action of <span class="math inline">\(a\in G\)</span> sends <span class="math inline">\((k, \theta)\)</span> to <span class="math inline">\((k, a\theta)\)</span>.</p>
<hr />
<p>Now a <strong>prior</strong> is a probability distribution <span class="math inline">\(\pi\)</span> over <span class="math inline">\(\mathcal{P}\)</span>. This is easiest to understand when the collection <span class="math inline">\(\mathcal{P}\)</span> is parametric, so that we have <span class="math inline">\(\Theta \subset \mathbb{R}^d\)</span>. When <span class="math inline">\(\Theta\)</span> is open, we may, as usual, describe the prior by its pdf, which we by abuse of notation will denote by <span class="math inline">\(\pi(\theta)\)</span>. Since <span class="math inline">\(G\)</span> acts on <span class="math inline">\(\Theta\)</span>, this action will transform the prior; namely, given <span class="math inline">\(g \in G\)</span>, we obtain new distribution <span class="math inline">\(g_*\pi\)</span>. When the action is differentiable, we have</p>
<p><span class="math display">\[[g_*\pi](\theta)=\pi(g^{-1}(\theta)) |J_{g^{-1}}(\theta)| \]</span></p>
<p>(Recall: If <span class="math inline">\(g(\psi)=\theta\)</span> then <span class="math inline">\(p(\psi) d\psi= p(g^{-1}(\theta)) \frac{d\psi}{d \theta} d\theta=p(g^{-1}(\theta)) |J_{g^{-1}}(\theta)| d\theta\)</span>.)</p>
<hr />
<p>Example 1 continued: For the location-scale <span class="math inline">\(g=\begin{pmatrix}a&amp;b\\0&amp;1 \end{pmatrix}\)</span> sends <span class="math inline">\(\theta=(m, s)\)</span> to <span class="math inline">\(g(\theta)=(am+b, as)\)</span>.</p>
<p>Then <span class="math inline">\(J_g(\theta)=\begin{pmatrix}a&amp;b\\0&amp;a\end{pmatrix}\)</span>, and so <span class="math inline">\(|J_g|=a^2\)</span>, <span class="math inline">\(|J_{g^{-1}}|=\frac{1}{a^2}\)</span></p>
<p><span class="math display">\[[g_*\pi](\theta)=\frac{1}{a^2}\pi(g^{-1}(\theta)). \]</span></p>
<hr />
<p>Now, one may argue that a transformation induced by <span class="math inline">\(g\)</span> is simply a “change of coordinates” and the problems of forming a prior about <span class="math inline">\(\theta\)</span> and about <span class="math inline">\(g(\theta)\)</span> are equivalent, and so we should posit</p>
<p><span class="math display">\[[g_*\pi](\theta)=\pi(\theta)\]</span></p>
<p>A prior satisfying this is called <strong>left invariant</strong>. For such a prior we have in the differentiable case</p>
<p><span class="math display">\[\pi(\theta)=\pi(g^{-1}(\theta)) |J_{g^{-1}}(\theta)| \]</span></p>
<p>In the cases where <span class="math inline">\(G\)</span> acts transitively on <span class="math inline">\(\Theta\)</span> this specifies <span class="math inline">\(\pi\)</span> uniquely, up to a constant since if we set <span class="math inline">\(\pi(\theta_0)=C\)</span> then for any <span class="math inline">\(\theta\)</span> there exists <span class="math inline">\(g\)</span> such that <span class="math inline">\(\theta=g(\theta_0)\)</span> and then</p>
<p><span class="math display">\[\pi(\theta)=\pi(g^{-1}(\theta)) |J_{g^{-1}}(\theta)|=C |J_{g^{-1}}(\theta)|\]</span></p>
<hr />
<p>Example 1 continued: If <span class="math inline">\(\pi\)</span> is left-invariant then taking <span class="math inline">\(\theta_0=(m_0=0, s_0=1)\)</span> and given <span class="math inline">\(\theta=(m,s)\)</span> we have <span class="math inline">\(g=\begin{pmatrix} m&amp;s\\0&amp;1\end{pmatrix}\)</span> and</p>
<p><span class="math display">\[\pi(m, s)= \frac{C}{s^2}.\]</span></p>
<p>(This agrees with 12.36 in Jaynes.)</p>
<hr />
<p>However, as pointed out by Berger (p.86) there is a logical flaw in requiring <span class="math inline">\([g_*\pi](\theta)=\pi(\theta)\)</span>. In fact, often the prior <span class="math inline">\(\pi\)</span> in question is improper, and so is only determined up to a constant. Thus we can only require a weaker equality</p>
<p><span class="math display">\[[g_*\pi](\theta)=K(g)\pi(\theta)\]</span></p>
<p>for some <span class="math inline">\(g\)</span>-dependent scaling function <span class="math inline">\(K(g)\)</span>.</p>
<p>We will call priors satisfying this “weakly” invariant, and the ones with <span class="math inline">\(K(g)=1\)</span> “strictly” invariant.</p>
<p>In the differentiable case this leads to</p>
<p><span class="math display">\[K(g)\pi(\theta)=\pi(g^{-1}(\theta)) |J_{g^{-1}}(\theta)|,\]</span></p>
<p>which can have many solutions.</p>
<hr />
<p>Example 1 continued: Take <span class="math inline">\(\pi(\theta)=C s^\alpha\)</span>, <span class="math inline">\(K(\begin{pmatrix}a&amp;b\\0&amp;1 \end{pmatrix})=a^{-(\alpha+2)}\)</span>. All of these solve the above equation.</p>
<!---
We will further abuse notation and denote $g=\begin{pmatrix} m&s\\0&1\end{pmatrix}$ by $(m, s)$ as well. We also define $k$ by $k(g^{-1})=K(g)$. Now plugging in $g^{-1}=(m,s)$
  $$k(m, s)\pi(m', s')= \pi(m's+m,ss' )\frac{1}{s^2}.$$
--->
<hr />
<p>The question of how to choose “invariant prior” (even when the transformation group is known) is hence not settled. One method is to use the framework of “invariant decision problems”, which suggests, that when the action of <span class="math inline">\(G\)</span> on <span class="math inline">\(\Theta\)</span> is “simply transitive”, one should use the <strong>right invariant</strong> prior. The story goes s follows:</p>
<p>Suppose that the group <span class="math inline">\(G\)</span> acts on <span class="math inline">\(\Theta\)</span> ins such a way that picking some starting <span class="math inline">\(\theta_0\)</span> we have for any <span class="math inline">\(\theta\in \Theta\)</span> a unique <span class="math inline">\(g\in G\)</span> with <span class="math inline">\(g_ \theta \theta_0=\theta\)</span>. Thus after making this choice of <span class="math inline">\(\theta_0\)</span> we can identify this <span class="math inline">\(\theta\)</span> with that unique <span class="math inline">\(g=f(\theta)\)</span>. (Note that this <span class="math inline">\(f\)</span> is a "map of <span class="math inline">\(G\)</span>-sets: <span class="math inline">\(f(g\theta)=g\cdot f(\theta)\)</span>, where <span class="math inline">\(\cdot\)</span> is the multiplication in <span class="math inline">\(G\)</span>.)</p>
<p>Now a distributions/measure over <span class="math inline">\(\Theta\)</span> is a measure over <span class="math inline">\(G\)</span>. “Strictly” invariant measures on <span class="math inline">\(\Theta\)</span> correspond to what’s known as “left-invariant” measures on <span class="math inline">\(G\)</span> (they are invariant under all left multiplication maps <span class="math inline">\(L_g:G\to G\)</span> sending <span class="math inline">\(g&#39;\in G\)</span> to <span class="math inline">\(L_g(g&#39;)=gg&#39;\)</span>). It is a theorem that such a measure is uniques up to scaling. There are, however, more “weakly” invariant measures (as we have seen), those that are “invarinat up to scaling function <span class="math inline">\(K(g)\)</span>”. Among those, there is unique up to scaling measure which is what is called <strong>right-invariant</strong>: it is invariant under all maps <span class="math inline">\(R_g:G\to G\)</span> sending <span class="math inline">\(g&#39;\)</span> to <span class="math inline">\(R_g(g&#39;)=g&#39;g\)</span> (the uniqueness up to scale and the fact that right invariant measures are “weakly” left invariant are basic facts of the theory of Haar measures on groups).</p>
<p>We illustrate the result on the location-scale example.</p>
<p>(We postpone more the discussion of <strong>why</strong> the right-invarinat prior is the best; see section 6.6 in Berger).</p>
<hr />
<p>Example 1 continued: Using <span class="math inline">\(\theta_0=(m_0=0, s_0=1)\)</span> the map <span class="math inline">\(f:\Theta \to G\)</span> sends <span class="math inline">\((m, s)\)</span> to <span class="math inline">\(\begin{pmatrix}s&amp;m\\0&amp;1\end{pmatrix}\)</span>.</p>
<p>The map <span class="math inline">\(R_g=R_{(a,b)}\)</span> sends <span class="math inline">\(\begin{pmatrix}s&amp;m\\0&amp;1\end{pmatrix}\)</span> to</p>
<p><span class="math inline">\(\begin{pmatrix}s&amp;m\\0&amp;1\end{pmatrix}\begin{pmatrix}a&amp;b\\0&amp;1\end{pmatrix}=\begin{pmatrix}as&amp;bs+m\\0&amp;1\end{pmatrix}\)</span>, and thus has Jacobian matrix in <span class="math inline">\((s, m)\)</span> coordinates equal to <span class="math inline">\(\begin{pmatrix}a&amp;0\\b&amp;1\end{pmatrix}\)</span>, and determinant <span class="math inline">\(|J_{R_g}|=a\)</span>, and <span class="math inline">\(|J_{R_{g^{-1}}}|=\frac{1}{a}\)</span>.</p>
<p>The invariance condition <span class="math inline">\(R_g* \pi (g&#39;)=\pi(g&#39;)\)</span> is then</p>
<p><span class="math display">\[ \pi(g&#39;) = \pi(g^{-1}g&#39;) \frac{1}{a}\]</span> and, setting <span class="math inline">\(\pi(e)=C\)</span> and <span class="math inline">\(g&#39;=g\)</span> we get</p>
<p><span class="math inline">\(\pi(g)=\frac{C}{a}\)</span>.</p>
<p>This is in fact the “invariant prior” 12.27.</p>
<p>So, to repeat what we said in the beginning, 12.27 is in fact “best” in this problem, but not because 12.18 (which has multiple issues) is better than 12.30, but because the deduction of 12.36 from 12.30 is flawed, and a better analysis of 12.30 (using the framework of invariant decision rules) does indeed lead to 12.27.</p>
<hr />
<p>So, why right invariant priors as opposed to left invariant ones? I don’t feel that I understand the answer completely. One thing I can say is that, roughly speaking, this may be considered to arize as follows: Suppose that you have a function of parameter and data <span class="math inline">\(f(x, \theta)\)</span> invariant under action <span class="math inline">\((x,  \theta) \to f(g x, g\theta)\)</span> (say, a loss function of some decision procedure). Then integrating “over <span class="math inline">\(X\)</span>” <span class="math inline">\(\int f(x, \theta_0) dx\)</span> can be rewritten as integrating “over <span class="math inline">\(G\)</span>” via <span class="math inline">\(\int f(gx_0, \theta) dg\)</span>, then as <span class="math inline">\(\int f(x_0, g^{-1}\theta) dg\)</span> and then, finally, as integration “over <span class="math inline">\(\Theta\)</span>”" i.e. <span class="math inline">\(\int f(x_0, g^{-1}\theta) d\theta\)</span>. In the <span class="math inline">\(X\)</span> to <span class="math inline">\(G\)</span> transition “preserves left-invariance” but <span class="math inline">\(G\)</span> to <span class="math inline">\(X\)</span> transition has a <span class="math inline">\(g^{-1}\)</span> and “moves left-invariance to right-invariance”.</p>
<hr />
<h3 id="comments-on-12.4.3">Comments on 12.4.3</h3>
<p>The derivation in 12.4.3 is suspect. Apart from dubious justification for invariance through “total confusion”, should we not ask why is it that <span class="math inline">\(a=\frac{p(E|Sx)}{p(E|FX)}\)</span> is the same for every member of this imaginary population of individuals?</p>
<p>Would it not be better to say that the group that is acting is the translation group acting on log likeliehoods (aka “evidence”)? I.e. that in terms of the odds ratio parameter <span class="math inline">\(l=log \frac{\theta}{1-\theta}\)</span> the (left and right) invariant prior is uniform? That, is the invariant prior is <span class="math inline">\(C dl\)</span>, which is <span class="math inline">\(C dl =C \frac{d l}{d\theta} d \theta= C \frac{1-\theta}{\theta} \frac{1}{(1-\theta)^2}=\frac{C}{\theta (1-\theta)}\)</span>.</p>
<p>(See Example 8 in Section 3.4.3 in Berger and the Kevin Van Horn’s <a href="http://ksvanhorn.com/bayes/jaynes/node14.html">page</a> for further options and discussion (though I find Van Horn’s limiting procedure unjustifiable as well.))</p>
<h3 id="from-12.48">12.50 from 12.48</h3>
<p>Plug in <span class="math inline">\(\theta=1/2\)</span> to get <span class="math inline">\(af(\frac{a}{1+a})=\frac{(1+a)^2}{4} f(1/2)\)</span></p>
<p>If <span class="math inline">\(\theta=\frac{a}{1+a}\)</span> then <span class="math inline">\((1+a)(1-\theta)=1\)</span> so <span class="math inline">\(a=\frac{\theta}{1-\theta}\)</span> and <span class="math inline">\((1+a)=\frac{1}{1-\theta}\)</span>; plugging this in we have</p>
<p><span class="math display">\[  \frac{\theta}{1-\theta} f(\theta)=\frac{C}{(1-\theta)^2}\]</span> <span class="math display">\[f(\theta)=\frac{C}{\theta(1-\theta)}.\]</span></p>
<h3 id="comments-on-12.4.4">Comments on 12.4.4</h3>
<p>It certainly <strong>does</strong> violence to Bertrand’s paradox to rephrase it in terms of throwing straws. The point of this paradox from the probability theory point of view os that saying “at random” is meaningless – one has to provide a probability distribution; replacing the words “at random” by the “throwing straws” procedure makes this point moot – the issue then becomes not the fact that the probability is unspecified, but, rather, that it is specified via a “physical” procedure, and one has to deduce the probability distribution from this procedure. This is what Jaynes proceeds to do. But this is an entirely different matter!</p>
<h3 id="from-12.61">12.62 from 12.61</h3>
<p><span class="math display">\[2af(ar)+a^2 r f&#39;(ar)=2\pi f(r)R aRf(aR) \]</span></p>
<p>Plug in <span class="math inline">\(a=1\)</span> we get <span class="math inline">\(2f(r)+rf&#39;(r)=2\pi f(r) R^2f(R)\)</span>. This can be solved directly, but we just say <span class="math inline">\(rf&#39;(r)=Af(r)\)</span> for some constant <span class="math inline">\(A\)</span>, which implies <span class="math inline">\(f(r)=Br^A\)</span> (either use logarithmic differentiation <span class="math inline">\((\ln f)&#39;=\frac{A}{r}\)</span> and integrate, or observe this is 1-D <a href="https://en.wikipedia.org/wiki/Homogeneous_function#Euler&#39;s_homogeneous_function_theorem">Euler equation for homogeneous function</a> or degree <span class="math inline">\(A\)</span>). Now we plug back in to 2.61</p>
<p><span class="math display">\[ a^2 B(ar)^A=2\pi Br^A \int_0^{aR} B u^{A+1} du\]</span></p>
<p><span class="math display">\[a^{2+A} B r^A= 2\pi B r^A B\frac{a^{A+2} R^{A+2}}{A+2}\]</span></p>
<p><span class="math display">\[1=B\frac{2\pi R^{A+2}}{A+2}\]</span></p>
<p>Setting <span class="math inline">\(q=A+2\)</span> we get <span class="math inline">\(B=\frac{q}{2\pi R^q}\)</span>, and <span class="math display">\[f(r)=\frac{qr^{q-2}}{2\pi R^q}\]</span> i.e. 12.62. We also see that this function does satisfy 12.61.</p>
</body>
</html>