Skip to content

Commit 44cc75c

Browse files
committed
🚀 Deploy updated DGM site (2025-11-30 16:19)
1 parent 9527971 commit 44cc75c

File tree

10 files changed

+884
-148
lines changed

10 files changed

+884
-148
lines changed
Binary file not shown.

‎dgm-fall-2025/homework/index.html‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ <h2 class="post-description"></h2>
9696
<li>HW3 (<a href="/dgm-fall-2025/assets/hw/hw3/STAT453_hw03.zip">zip</a>) due Friday, October 17th 11:59PM to Canvas.</li>
9797
<li>HW4 (<a href="https://colab.research.google.com/drive/1Jm2hrqbikyTC221moR9nfuaAoP638YJl?usp=sharing">Colab</a>) due Friday, November 21st 11:59PM to Canvas.</li>
9898
<li>HW5 (<a href="https://colab.research.google.com/drive/1A8y1FfcrSb5O0HqI5oLXc1gZet4n4JtM?usp=sharing">Colab</a>) due Sunday, December 7th 11:59PM to Canvas.</li>
99+
<li><a href="/dgm-fall-2025/assets/hw/Stat453_F2025_ExamStudyGuide.pdf">Exam Study Guide</a> released.</li>
99100
</ul>
100101
</article>
101102
</div>

‎dgm-fall-2025/notes/index.html‎

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,16 @@ <h2>The notes written by students and edited by instructors</h2>
9393

9494
<ul class="post-list">
9595

96+
<li >
97+
<p class="post-meta">November 29, 2025</p>
98+
<h2>
99+
<a class="post-title" href="/dgm-fall-2025/notes/lecture-22/"
100+
>Lecture 22</a
101+
>
102+
</h2>
103+
<p>Unsupervised Training of LLMs</p>
104+
</li>
105+
96106
<li >
97107
<p class="post-meta">November 17, 2025</p>
98108
<h2>
@@ -173,7 +183,7 @@ <h2>
173183
<p>Improving Optimization</p>
174184
</li>
175185

176-
<li >
186+
<li style="border-bottom: none;" >
177187
<p class="post-meta">October 13, 2025</p>
178188
<h2>
179189
<a class="post-title" href="/dgm-fall-2025/notes/lecture-10/"
@@ -183,16 +193,6 @@ <h2>
183193
<p>Regularization and Generalization</p>
184194
</li>
185195

186-
<li style="border-bottom: none;" >
187-
<p class="post-meta">October 8, 2025</p>
188-
<h2>
189-
<a class="post-title" href="/dgm-fall-2025/notes/lecture-11/"
190-
>Lecture 11</a
191-
>
192-
</h2>
193-
<p>Normalization / Initialization</p>
194-
</li>
195-
196196
</ul>
197197

198198

‎dgm-fall-2025/notes/lecture-15/index.html‎

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@
6868
"editors": [
6969

7070
{
71-
"editor": "-"
71+
"editor": "Jeff Zhang"
7272
}
7373

7474
],
@@ -286,8 +286,9 @@ <h2 id="4-example-generative-model-naive-bayes">4. Example Generative Model: Nai
286286
</ul>
287287
</li>
288288
<li>Estimate
289+
<!-- - $\hat{\mu} , \hat{\sigma} = argmax_{\mu , \sigma} P(X \mid Y)$ -->
289290
<ul>
290-
<li>$\hat{\mu} , \hat{\sigma} = argmax_{\mu , \sigma} P(X \mid Y)$</li>
291+
<li>$\hat{\mu}, \hat{\sigma} = \arg\max_{\mu, \sigma} P(X \mid Y)$</li>
291292
</ul>
292293
</li>
293294
<li>Calculate
@@ -314,10 +315,12 @@ <h2 id="5-discriminative-vs-generative-models">5. Discriminative vs Generative M
314315

315316
<ul>
316317
<li>Discriminative models optimize the conditional likelihood:
317-
$\hat{\theta_{disc}} = argmax_{\theta} P(Y \mid X; \theta) = argmax_{\theta} \frac{P(X \mid Y ; \theta) P(Y; \theta)}{P(X; \theta)}$</li>
318+
<!--$\hat{\theta_{disc}} = argmax_{\theta} P(Y \mid X; \theta) = argmax_{\theta} \frac{P(X \mid Y ; \theta) P(Y; \theta)}{P(X; \theta)}$-->
319+
$\widehat{\theta_{disc}} = argmax_{\theta} P(Y \mid X; \theta) = argmax_{\theta} \frac{P(X \mid Y ; \theta) P(Y; \theta)}{P(X; \theta)}$</li>
318320
<li>
319321
<p>Generative models optimize the joint likelihood:
320-
$\hat{\theta_{disc}} = argmax_{\theta} P(X, Y; \theta) = argmax_{\theta} P(X \mid Y ; \theta) P(Y; \theta)$</p>
322+
<!--$\hat{\theta_{disc}} = argmax_{\theta} P(X, Y; \theta) = argmax_{\theta} P(X \mid Y ; \theta) P(Y; \theta)$-->
323+
$\widehat{\theta_{disc}} = argmax_{\theta} P(X, Y; \theta) = argmax_{\theta} P(X \mid Y ; \theta) P(Y; \theta)$</p>
321324

322325
<p>This means they are exactly the same optimization when $P(X; \theta)$ is invariant to $\theta$</p>
323326
</li>
@@ -329,17 +332,20 @@ <h2 id="5-discriminative-vs-generative-models">5. Discriminative vs Generative M
329332

330333
<h2 id="6-logistic-regression-vs-naive-bayes">6. Logistic Regression vs Naive Bayes</h2>
331334

332-
<h3 id="logistic-regression">Logistic Regression**</h3>
335+
<!--### Logistic Regression**-->
336+
<h3 id="logistic-regression">Logistic Regression</h3>
333337
<ul>
334338
<li><strong>Type</strong> : Discriminative</li>
335339
<li>It directly models the decision boundary: $P(Y \mid X; \theta)$</li>
336340
<li>It does <strong>not</strong> models how data is generated only the probability that a given (X) belongs to a certain class (Y )</li>
337341
<li>Defines : $P(Y \mid X; \theta) = \sigma(\theta^T X) = \frac{1}{1 + e^{-\theta^T X}}$</li>
338342
<li>Estimates:
343+
<!--Parameters are learned by maximizing the conditional likelihood: $\hat{\theta_{lr}} = argmax_{\theta} P(Y \mid X; \theta)$-->
339344
<ul>
340-
<li>Parameters are learned by maximizing the conditional likelihood: $\hat{\theta_{lr}} = argmax_{\theta} P(Y \mid X; \theta)$</li>
345+
<li>Parameters are learned by maximizing the conditional likelihood: $\widehat{\theta_{lr}} = argmax_{\theta} P(Y \mid X; \theta)$</li>
341346
<li>Or equivalently, by maximizing the <strong>log-likelihood</strong>:
342-
$\hat{\theta_{lr}} = \arg\max_{\theta} \sum_i \left[Y_i \log \sigma(\theta^{T} X_i) + (1 - Y_i) \log \left(1 - \sigma(\theta^{T} X_i)\right)\right]$</li>
347+
<!--$\hat{\theta_{lr}} = \arg\max_{\theta} \sum_i \left[Y_i \log \sigma(\theta^{T} X_i) + (1 - Y_i) \log \left(1 - \sigma(\theta^{T} X_i)\right)\right]$-->
348+
$\widehat{\theta_{lr}} = \arg\max_{\theta} \sum_i \left[Y_i \log \sigma(\theta^{T} X_i) + (1 - Y_i) \log \left(1 - \sigma(\theta^{T} X_i)\right)\right]$</li>
343349
</ul>
344350
</li>
345351
<li>Properties:
@@ -362,8 +368,9 @@ <h3 id="naive-bayes">Naive Bayes</h3>
362368
<li>Using Bayes’ rule, we can derive the posterior: $P(Y \mid X) = \frac{P(X \mid Y) \, P(Y)}{P(X)}$</li>
363369
</ul>
364370
</li>
365-
<li>Assumption: <strong>Conditional independence assumption</strong> $P(X \mid Y) = \prod_{j=1}^{d} P(X_j \mid Y)$</li>
366-
<li>Estimates: Parameters are learned by maximizing the <strong>joint likelihood</strong>: $\hat{\theta_{NB}} = \arg\max_{\theta} P(X, Y; \theta)$</li>
371+
<li>Assumption: <strong>Conditional independence assumption</strong> $P(X \mid Y) = \prod_{j=1}^{d} P(X_j \mid Y)$
372+
<!--- Estimates: Parameters are learned by maximizing the **joint likelihood**: $\hat{\theta_{NB}} = \arg\max_{\theta} P(X, Y; \theta)$--></li>
373+
<li>Estimates: Parameters are learned by maximizing the <strong>joint likelihood</strong>: $\widehat{\theta_{NB}} = \arg\max_{\theta} P(X, Y; \theta)$</li>
367374
<li>Properties:
368375
<ul>
369376
<li><strong>Higher asymptotic error</strong> : Because the independence assumption is not always true, it can produce biased estimates when data features are correlated</li>

‎dgm-fall-2025/notes/lecture-16/index.html‎

Lines changed: 58 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@
7272
"editors": [
7373

7474
{
75-
"editor": "Assistant Editor"
75+
"editor": "Jeff Zhang"
7676
}
7777

7878
],
@@ -153,7 +153,25 @@ <h1>Lecture 16 - Autoencoders and Variational Autoencoders (VAEs)</h1>
153153

154154
<d-byline></d-byline>
155155

156-
<d-article> <h2 id="1-introduction-and-overview">1. Introduction and Overview</h2>
156+
<d-article> <h2 id="todays-topics">Today’s Topics:</h2>
157+
<ul>
158+
<li><a href="#todays-topics">Today’s Topics:</a></li>
159+
<li><a href="#1-introduction-and-overview">1. Introduction and Overview</a></li>
160+
<li><a href="#2-discriminative-vs-generative-modeling">2. Discriminative vs. Generative Modeling</a></li>
161+
<li><a href="#3-deep-generative-models-dgms">3. Deep Generative Models (DGMs)</a></li>
162+
<li><a href="#4-autoencoders-concept-and-motivation">4. Autoencoders: Concept and Motivation</a></li>
163+
<li><a href="#5-architecture-of-an-autoencoder">5. Architecture of an Autoencoder</a></li>
164+
<li><a href="#6-autoencoders-vs-pca">6. Autoencoders vs PCA</a></li>
165+
<li><a href="#7-autoencoder-variants">7. Autoencoder Variants</a></li>
166+
<li><a href="#8-variational-autoencoders-vaes-theory-and-intuition">8. Variational Autoencoders (VAEs): Theory and Intuition</a></li>
167+
<li><a href="#9-the-reparameterization-trick">9. The Reparameterization Trick</a></li>
168+
<li><a href="#10-generating-new-samples-with-vaes">10. Generating New Samples with VAEs</a></li>
169+
<li><a href="#11-applications">11. Applications</a></li>
170+
<li><a href="#12-summary-and-takeaways">12. Summary and Takeaways</a></li>
171+
<li><a href="#references">References</a></li>
172+
</ul>
173+
174+
<h2 id="1-introduction-and-overview">1. Introduction and Overview</h2>
157175

158176
<p>This lecture focuses on <strong>Deep Generative Models (DGMs)</strong> — models designed to learn the underlying data distribution, enabling both prediction and generation of new samples.
159177
We move from <strong>discriminative models</strong>, which model $P(Y|X)$, to <strong>generative models</strong>, which model $P(X, Y)$ or $P(X)$.</p>
@@ -162,7 +180,7 @@ <h1>Lecture 16 - Autoencoders and Variational Autoencoders (VAEs)</h1>
162180

163181
<hr />
164182

165-
<h2 id="2-discriminative-vs-generative-modeling">2. Discriminative vs. Generative Modeling</h2>
183+
<h2 id="2-discriminative-vs-generative-modeling">2. Discriminative vs Generative Modeling</h2>
166184

167185
<table>
168186
<thead>
@@ -171,32 +189,39 @@ <h2 id="2-discriminative-vs-generative-modeling">2. Discriminative vs. Generativ
171189
<th>Learns</th>
172190
<th>Objective</th>
173191
<th>Examples</th>
174-
<th> </th>
175192
</tr>
176193
</thead>
177194
<tbody>
178195
<tr>
179196
<td><strong>Discriminative</strong></td>
180-
<td>$P(Y</td>
181-
<td>X)$</td>
197+
<td>$P(Y \mid X)$</td>
182198
<td>Classify or predict outcomes</td>
183199
<td>Logistic Regression, CNNs</td>
184200
</tr>
185201
<tr>
186202
<td><strong>Generative</strong></td>
187-
<td>$P(X, Y)$ or $P(X</td>
188-
<td>Y)$</td>
203+
<td>$P(X, Y)$ or $P(X \mid Y)$</td>
189204
<td>Model data generation process</td>
190205
<td>Autoencoders, VAEs, GANs</td>
191206
</tr>
192207
</tbody>
193208
</table>
194209

210+
<!-- format modified by JZ -->
211+
195212
<p>In generative models, <strong>latent variables</strong> $z$ represent hidden structure in the data, making the following computations challenging:</p>
196213

197-
<p>\(p_\theta(x) = \int p_\theta(x, z) \, dz\)
198-
\(p_\theta(z|x) \propto p_\theta(x|z) p(z)\)</p>
214+
<!--$$ p_\theta(x) = \int p_\theta(x, z) \, dz $$
215+
$$ p_\theta(z|x) \propto p_\theta(x|z) p(z) $$ -->
216+
217+
<d-math block="">
218+
p_\theta(x) = \int p_\theta(x, z)\, dz
219+
</d-math>
199220

221+
<d-math block="">
222+
p_\theta(z \mid x) \propto p_\theta(x \mid z)\, p(z)
223+
</d-math>
224+
<!-- format modified by JZ -->
200225
<p>Because $z$ is unobserved, both the marginal likelihood and posterior inference are intractable in complex data, requiring <strong>approximate inference</strong>.</p>
201226

202227
<hr />
@@ -211,13 +236,14 @@ <h2 id="3-deep-generative-models-dgms">3. Deep Generative Models (DGMs)</h2>
211236
<li>Learn probabilistic mappings between $x$ and $z$.</li>
212237
<li>Use neural networks for non-linear transformations.</li>
213238
<li>Combine deep learning’s representational power with probabilistic reasoning.</li>
239+
<li>Latent variable $z$ capture hidden structure that explains high-dimensional observations $x$. <!-- add by JZ --></li>
214240
</ul>
215241

216242
<hr />
217243

218244
<h2 id="4-autoencoders-concept-and-motivation">4. Autoencoders: Concept and Motivation</h2>
219245

220-
<p>An <strong>Autoencoder (AE)</strong> is an unsupervised neural network trained to reproduce its input.
246+
<p>An <strong>Autoencoder (AE)</strong> is an <strong>unsupervised</strong> (no labeled) neural network trained to reproduce its input.
221247
It compresses the input into a <strong>latent representation (code)</strong> and reconstructs the input from this compressed form.</p>
222248

223249
<p><strong>Applications:</strong></p>
@@ -240,8 +266,12 @@ <h2 id="5-architecture-of-an-autoencoder">5. Architecture of an Autoencoder</h2>
240266
</ul>
241267

242268
<p><strong>Training objective:</strong>
243-
\(L(x, \hat{x}) = ||x - \hat{x}||^2\)</p>
269+
<!--$$ L(x, \hat{x}) = ||x - \hat{x}||^2 $$--></p>
244270

271+
<d-math block="">
272+
L(x, \hat{x}) = ||x - \hat{x}||^2
273+
</d-math>
274+
<!-- format modified by JZ -->
245275
<p>By minimizing reconstruction loss, the network learns to capture meaningful low-dimensional structure.</p>
246276

247277
<hr />
@@ -279,42 +309,31 @@ <h3 id="74-variational-autoencoders-vaes">7.4 Variational Autoencoders (VAEs)</h
279309

280310
<h2 id="8-variational-autoencoders-vaes-theory-and-intuition">8. Variational Autoencoders (VAEs): Theory and Intuition</h2>
281311

312+
<!-- format of this section was modified by JZ to address render error. No content changed-->
282313
<p>VAEs model the data generation process as:</p>
283314

284315
<ol>
285316
<li>Sample latent variable $z \sim p(z)$ (e.g., $\mathcal{N}(0, I)$).</li>
286-
<li>
287-
<table>
288-
<tbody>
289-
<tr>
290-
<td>Generate data $x$ from conditional distribution $p_\theta(x</td>
291-
<td>z)$.</td>
292-
</tr>
293-
</tbody>
294-
</table>
295-
</li>
317+
<li>Generate data $x$ from conditional distribution $p_\theta(x \mid z)$. <!-- format modified by JZ --></li>
296318
</ol>
297319

298-
<table>
299-
<tbody>
300-
<tr>
301-
<td>The encoder approximates the posterior $q_\phi(z</td>
302-
<td>x)$ using neural networks, producing mean and variance parameters $(\mu, \sigma)$.</td>
303-
</tr>
304-
</tbody>
305-
</table>
320+
<p>The encoder approximates the posterior $q_\phi(z \mid x)$ using neural networks, producing mean and variance parameters $(\mu, \sigma)$.</p>
306321

307322
<p>We cannot compute $\log p_\theta(x)$ directly because the marginalization over $z$ is intractable.<br />
308-
By introducing an approximate posterior $q_\phi(z|x)$ and applying Jensen’s inequality:</p>
323+
By introducing an approximate posterior $q_\phi(z \mid x)$ and applying Jensen’s inequality:</p>
309324

310-
\[\log p_\theta(x) = \log \int p_\theta(x,z)dz
311-
\ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)||p(z))\]
325+
<d-math block="">
326+
\log p_\theta(x) = \log \int p_\theta(x,z)dz
327+
\ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)||p(z))
328+
</d-math>
312329

313330
<p>This lower bound is the <strong>Evidence Lower Bound (ELBO)</strong>, which VAEs maximize during training.</p>
314331

315332
<p><strong>Objective (Evidence Lower Bound - ELBO):</strong></p>
316333

317-
\[\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))\]
334+
<d-math block="">
335+
\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))
336+
</d-math>
318337

319338
<ul>
320339
<li>First term → reconstruction accuracy</li>
@@ -327,7 +346,11 @@ <h2 id="9-the-reparameterization-trick">9. The Reparameterization Trick</h2>
327346

328347
<p>To allow gradients to flow through random sampling, we use:</p>
329348

330-
\[z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]
349+
<!-- format of this section was modified by JZ to address render error. No content changed-->
350+
351+
<d-math block="">
352+
z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
353+
</d-math>
331354

332355
<p>This separates stochasticity from the deterministic part, making training possible with backpropagation.</p>
333356

‎dgm-fall-2025/notes/lecture-20/index.html‎

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@
6464
"editors": [
6565

6666
{
67-
"editor": ""
67+
"editor": "Xinyu Pan"
6868
}
6969

7070
],
@@ -150,6 +150,11 @@ <h1>Lecture 20</h1>
150150
<h2 id="the-attention-mechanism">The Attention Mechanism</h2>
151151

152152
<p><strong>Motivation:</strong> Different parts of our input relate to different parts of our output. Sometimes these important relationships can be far apart, like in machine translation. Attention helps us dynamically calculate what is important.</p>
153+
<h3 id="why-attention-was-needed-long-range-dependency-problem">Why Attention Was Needed (Long-Range Dependency Problem)</h3>
154+
<p>RNNs compress an entire input sequence into a <strong>single hidden vector</strong>, which makes capturing long-range dependencies difficult.<br />
155+
During backpropagation, gradients must pass through many time steps, causing <strong>vanishing/exploding gradients</strong>.</p>
156+
157+
<p>Attention solves this by <strong>directly referencing the entire input sequence</strong> when predicting each output, instead of relying on hidden states to store all information.</p>
153158

154159
<ul>
155160
<li><strong>Origin:</strong> Originally from Natural Language Processing (NLP) and language translation.</li>
@@ -193,6 +198,12 @@ <h4 id="soft-attention-vs-rnn-for-image-captioning">Soft Attention vs. RNN for I
193198
<p><strong>Aside:</strong> CNNs were an example of <strong>Hard Attention</strong>. As the filter slides over the image, the part of the image inside the filter gets attention weight 1, and the rest gets weight 0.</p>
194199
</blockquote>
195200

201+
<h3 id="why-attention-reduces-the-need-for-recurrence">Why Attention Reduces the Need for Recurrence</h3>
202+
<ul>
203+
<li>Attention repeatedly refers back to the input, so the hidden state no longer needs to store all global information.</li>
204+
<li>This insight led to eliminating recurrence entirely in the Transformer.</li>
205+
</ul>
206+
196207
<hr />
197208

198209
<h2 id="self-attention">Self Attention</h2>

0 commit comments

Comments
 (0)