AdaptInfer
diff --git a/‎dgm-fall-2025/assets/hw/Stat453_F2025_ExamStudyGuide.pdf‎
192 KB b/‎dgm-fall-2025/assets/hw/Stat453_F2025_ExamStudyGuide.pdf‎
192 KB
diff --git a/‎dgm-fall-2025/homework/index.html‎
Lines changed: 1 addition & 0 deletions b/‎dgm-fall-2025/homework/index.html‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎dgm-fall-2025/notes/index.html‎
Lines changed: 11 additions & 11 deletions b/‎dgm-fall-2025/notes/index.html‎
Lines changed: 11 additions & 11 deletions
diff --git a/‎dgm-fall-2025/notes/lecture-15/index.html‎
Lines changed: 16 additions & 9 deletions b/‎dgm-fall-2025/notes/lecture-15/index.html‎
Lines changed: 16 additions & 9 deletions
diff --git a/‎dgm-fall-2025/notes/lecture-16/index.html‎
Lines changed: 58 additions & 35 deletions b/‎dgm-fall-2025/notes/lecture-16/index.html‎
Lines changed: 58 additions & 35 deletions
diff --git a/‎dgm-fall-2025/notes/lecture-20/index.html‎
Lines changed: 12 additions & 1 deletion b/‎dgm-fall-2025/notes/lecture-20/index.html‎
Lines changed: 12 additions & 1 deletion
@@ -96,6 +96,7 @@ <h2 class="post-description"></h2>
   <li>HW3 (<a href="/dgm-fall-2025/assets/hw/hw3/STAT453_hw03.zip">zip</a>) due Friday, October 17th 11:59PM to Canvas.</li>
   <li>HW4 (<a href="https://colab.research.google.com/drive/1Jm2hrqbikyTC221moR9nfuaAoP638YJl?usp=sharing">Colab</a>) due Friday, November 21st 11:59PM to Canvas.</li>
   <li>HW5 (<a href="https://colab.research.google.com/drive/1A8y1FfcrSb5O0HqI5oLXc1gZet4n4JtM?usp=sharing">Colab</a>) due Sunday, December 7th 11:59PM to Canvas.</li>
+  <li><a href="/dgm-fall-2025/assets/hw/Stat453_F2025_ExamStudyGuide.pdf">Exam Study Guide</a> released.</li>
 </ul>
 </article>
 </div>
 
@@ -93,6 +93,16 @@ <h2>The notes written by students and edited by instructors</h2>
 
 <ul class="post-list">
 
+  <li >
+    <p class="post-meta">November 29, 2025</p>
+    <h2>
+      <a class="post-title" href="/dgm-fall-2025/notes/lecture-22/"
+        >Lecture 22</a
+      >
+    </h2>
+    <p>Unsupervised Training of LLMs</p>
+  </li>
+  
   <li >
     <p class="post-meta">November 17, 2025</p>
     <h2>
@@ -173,7 +183,7 @@ <h2>
     <p>Improving Optimization</p>
   </li>
 
-  <li >
+  <li style="border-bottom: none;" >
     <p class="post-meta">October 13, 2025</p>
     <h2>
       <a class="post-title" href="/dgm-fall-2025/notes/lecture-10/"
@@ -183,16 +193,6 @@ <h2>
     <p>Regularization and Generalization</p>
   </li>
 
-  <li style="border-bottom: none;" >
-    <p class="post-meta">October 8, 2025</p>
-    <h2>
-      <a class="post-title" href="/dgm-fall-2025/notes/lecture-11/"
-        >Lecture 11</a
-      >
-    </h2>
-    <p>Normalization / Initialization</p>
-  </li>
-  
 </ul>
 
 
 
@@ -68,7 +68,7 @@
             "editors": [
 
               {
-                "editor": "-"
+                "editor": "Jeff Zhang"
               }
 
             ],
@@ -286,8 +286,9 @@ <h2 id="4-example-generative-model-naive-bayes">4. Example Generative Model: Nai
     </ul>
   </li>
   <li>Estimate
+ <!-- - $\hat{\mu} , \hat{\sigma} = argmax_{\mu , \sigma} P(X \mid Y)$ -->
     <ul>
-      <li>$\hat{\mu} , \hat{\sigma} = argmax_{\mu , \sigma} P(X \mid Y)$</li>
+      <li>$\hat{\mu}, \hat{\sigma} = \arg\max_{\mu, \sigma} P(X \mid Y)$</li>
     </ul>
   </li>
   <li>Calculate
@@ -314,10 +315,12 @@ <h2 id="5-discriminative-vs-generative-models">5. Discriminative vs Generative M
 
 <ul>
   <li>Discriminative models optimize the conditional likelihood:
-$\hat{\theta_{disc}} = argmax_{\theta} P(Y \mid X; \theta) = argmax_{\theta} \frac{P(X \mid Y ; \theta) P(Y; \theta)}{P(X; \theta)}$</li>
+<!--$\hat{\theta_{disc}} = argmax_{\theta} P(Y \mid X; \theta) = argmax_{\theta} \frac{P(X \mid Y ; \theta) P(Y; \theta)}{P(X; \theta)}$-->
+$\widehat{\theta_{disc}} = argmax_{\theta} P(Y \mid X; \theta) = argmax_{\theta} \frac{P(X \mid Y ; \theta) P(Y; \theta)}{P(X; \theta)}$</li>
   <li>
     <p>Generative models optimize the joint likelihood:
-$\hat{\theta_{disc}} = argmax_{\theta} P(X, Y; \theta) = argmax_{\theta} P(X \mid Y ; \theta) P(Y; \theta)$</p>
+<!--$\hat{\theta_{disc}} = argmax_{\theta} P(X, Y; \theta) = argmax_{\theta} P(X \mid Y ; \theta) P(Y; \theta)$-->
+$\widehat{\theta_{disc}} = argmax_{\theta} P(X, Y; \theta) = argmax_{\theta} P(X \mid Y ; \theta) P(Y; \theta)$</p>
 
     <p>This means they are exactly the same optimization when $P(X; \theta)$ is invariant to $\theta$</p>
   </li>
@@ -329,17 +332,20 @@ <h2 id="5-discriminative-vs-generative-models">5. Discriminative vs Generative M
 
 <h2 id="6-logistic-regression-vs-naive-bayes">6. Logistic Regression vs Naive Bayes</h2>
 
-<h3 id="logistic-regression">Logistic Regression**</h3>
+<!--### Logistic Regression**-->
+<h3 id="logistic-regression">Logistic Regression</h3>
 <ul>
   <li><strong>Type</strong> : Discriminative</li>
   <li>It directly models the decision boundary: $P(Y \mid X; \theta)$</li>
   <li>It does <strong>not</strong> models how data is generated only the probability that a given (X) belongs to a certain class (Y )</li>
   <li>Defines : $P(Y \mid X; \theta) = \sigma(\theta^T X) = \frac{1}{1 + e^{-\theta^T X}}$</li>
   <li>Estimates:
+ <!--Parameters are learned by maximizing the conditional likelihood: $\hat{\theta_{lr}} = argmax_{\theta} P(Y \mid X; \theta)$-->
     <ul>
-      <li>Parameters are learned by maximizing the conditional likelihood: $\hat{\theta_{lr}} = argmax_{\theta} P(Y \mid X; \theta)$</li>
+      <li>Parameters are learned by maximizing the conditional likelihood: $\widehat{\theta_{lr}} = argmax_{\theta} P(Y \mid X; \theta)$</li>
       <li>Or equivalently, by maximizing the <strong>log-likelihood</strong>:
-$\hat{\theta_{lr}} = \arg\max_{\theta} \sum_i \left[Y_i \log \sigma(\theta^{T} X_i) + (1 - Y_i) \log \left(1 - \sigma(\theta^{T} X_i)\right)\right]$</li>
+<!--$\hat{\theta_{lr}} = \arg\max_{\theta} \sum_i \left[Y_i \log \sigma(\theta^{T} X_i) + (1 - Y_i) \log \left(1 - \sigma(\theta^{T} X_i)\right)\right]$-->
+$\widehat{\theta_{lr}} = \arg\max_{\theta} \sum_i \left[Y_i \log \sigma(\theta^{T} X_i) + (1 - Y_i) \log \left(1 - \sigma(\theta^{T} X_i)\right)\right]$</li>
     </ul>
   </li>
   <li>Properties:
@@ -362,8 +368,9 @@ <h3 id="naive-bayes">Naive Bayes</h3>
       <li>Using Bayes’ rule, we can derive the posterior: $P(Y \mid X) = \frac{P(X \mid Y) \, P(Y)}{P(X)}$</li>
     </ul>
   </li>
-  <li>Assumption: <strong>Conditional independence assumption</strong> $P(X \mid Y) = \prod_{j=1}^{d} P(X_j \mid Y)$</li>
-  <li>Estimates: Parameters are learned by maximizing the <strong>joint likelihood</strong>: $\hat{\theta_{NB}} = \arg\max_{\theta} P(X, Y; \theta)$</li>
+  <li>Assumption: <strong>Conditional independence assumption</strong> $P(X \mid Y) = \prod_{j=1}^{d} P(X_j \mid Y)$
+<!--- Estimates: Parameters are learned by maximizing the **joint likelihood**: $\hat{\theta_{NB}} = \arg\max_{\theta} P(X, Y; \theta)$--></li>
+  <li>Estimates: Parameters are learned by maximizing the <strong>joint likelihood</strong>: $\widehat{\theta_{NB}} = \arg\max_{\theta} P(X, Y; \theta)$</li>
   <li>Properties:
     <ul>
       <li><strong>Higher asymptotic error</strong> : Because the independence assumption is not always true, it can produce biased estimates when data features are correlated</li>
 
@@ -72,7 +72,7 @@
             "editors": [
 
               {
-                "editor": "Assistant Editor"
+                "editor": "Jeff Zhang"
               }
 
             ],
@@ -153,7 +153,25 @@ <h1>Lecture 16 - Autoencoders and Variational Autoencoders (VAEs)</h1>
 
       <d-byline></d-byline>
 
-      <d-article> <h2 id="1-introduction-and-overview">1. Introduction and Overview</h2>
+      <d-article> <h2 id="todays-topics">Today’s Topics:</h2>
+<ul>
+  <li><a href="#todays-topics">Today’s Topics:</a></li>
+  <li><a href="#1-introduction-and-overview">1. Introduction and Overview</a></li>
+  <li><a href="#2-discriminative-vs-generative-modeling">2. Discriminative vs. Generative Modeling</a></li>
+  <li><a href="#3-deep-generative-models-dgms">3. Deep Generative Models (DGMs)</a></li>
+  <li><a href="#4-autoencoders-concept-and-motivation">4. Autoencoders: Concept and Motivation</a></li>
+  <li><a href="#5-architecture-of-an-autoencoder">5. Architecture of an Autoencoder</a></li>
+  <li><a href="#6-autoencoders-vs-pca">6. Autoencoders vs PCA</a></li>
+  <li><a href="#7-autoencoder-variants">7. Autoencoder Variants</a></li>
+  <li><a href="#8-variational-autoencoders-vaes-theory-and-intuition">8. Variational Autoencoders (VAEs): Theory and Intuition</a></li>
+  <li><a href="#9-the-reparameterization-trick">9. The Reparameterization Trick</a></li>
+  <li><a href="#10-generating-new-samples-with-vaes">10. Generating New Samples with VAEs</a></li>
+  <li><a href="#11-applications">11. Applications</a></li>
+  <li><a href="#12-summary-and-takeaways">12. Summary and Takeaways</a></li>
+  <li><a href="#references">References</a></li>
+</ul>
+
+<h2 id="1-introduction-and-overview">1. Introduction and Overview</h2>
 
 <p>This lecture focuses on <strong>Deep Generative Models (DGMs)</strong> — models designed to learn the underlying data distribution, enabling both prediction and generation of new samples. 
 We move from <strong>discriminative models</strong>, which model $P(Y|X)$, to <strong>generative models</strong>, which model $P(X, Y)$ or $P(X)$.</p>
@@ -162,7 +180,7 @@ <h1>Lecture 16 - Autoencoders and Variational Autoencoders (VAEs)</h1>
 
 <hr />
 
-<h2 id="2-discriminative-vs-generative-modeling">2. Discriminative vs. Generative Modeling</h2>
+<h2 id="2-discriminative-vs-generative-modeling">2. Discriminative vs Generative Modeling</h2>
 
 <table>
   <thead>
@@ -171,32 +189,39 @@ <h2 id="2-discriminative-vs-generative-modeling">2. Discriminative vs. Generativ
       <th>Learns</th>
       <th>Objective</th>
       <th>Examples</th>
-      <th> </th>
     </tr>
   </thead>
   <tbody>
     <tr>
       <td><strong>Discriminative</strong></td>
-      <td>$P(Y</td>
-      <td>X)$</td>
+      <td>$P(Y \mid X)$</td>
       <td>Classify or predict outcomes</td>
       <td>Logistic Regression, CNNs</td>
     </tr>
     <tr>
       <td><strong>Generative</strong></td>
-      <td>$P(X, Y)$ or $P(X</td>
-      <td>Y)$</td>
+      <td>$P(X, Y)$ or $P(X \mid Y)$</td>
       <td>Model data generation process</td>
       <td>Autoencoders, VAEs, GANs</td>
     </tr>
   </tbody>
 </table>
 
+<!-- format modified by JZ -->
+
 <p>In generative models, <strong>latent variables</strong> $z$ represent hidden structure in the data, making the following computations challenging:</p>
 
-<p>\(p_\theta(x) = \int p_\theta(x, z) \, dz\)
-\(p_\theta(z|x) \propto p_\theta(x|z) p(z)\)</p>
+<!--$$ p_\theta(x) = \int p_\theta(x, z) \, dz $$
+$$ p_\theta(z|x) \propto p_\theta(x|z) p(z) $$ -->
+
+<d-math block="">
+p_\theta(x) = \int p_\theta(x, z)\, dz
+</d-math>
 
+<d-math block="">
+p_\theta(z \mid x) \propto p_\theta(x \mid z)\, p(z)
+</d-math>
+<!-- format modified by JZ -->
 <p>Because $z$ is unobserved, both the marginal likelihood and posterior inference are intractable in complex data, requiring <strong>approximate inference</strong>.</p>
 
 <hr />
@@ -211,13 +236,14 @@ <h2 id="3-deep-generative-models-dgms">3. Deep Generative Models (DGMs)</h2>
   <li>Learn probabilistic mappings between $x$ and $z$.</li>
   <li>Use neural networks for non-linear transformations.</li>
   <li>Combine deep learning’s representational power with probabilistic reasoning.</li>
+  <li>Latent variable $z$ capture hidden structure that explains high-dimensional observations $x$. <!-- add by JZ --></li>
 </ul>
 
 <hr />
 
 <h2 id="4-autoencoders-concept-and-motivation">4. Autoencoders: Concept and Motivation</h2>
 
-<p>An <strong>Autoencoder (AE)</strong> is an unsupervised neural network trained to reproduce its input. 
+<p>An <strong>Autoencoder (AE)</strong> is an <strong>unsupervised</strong> (no labeled) neural network trained to reproduce its input. 
 It compresses the input into a <strong>latent representation (code)</strong> and reconstructs the input from this compressed form.</p>
 
 <p><strong>Applications:</strong></p>
@@ -240,8 +266,12 @@ <h2 id="5-architecture-of-an-autoencoder">5. Architecture of an Autoencoder</h2>
 </ul>
 
 <p><strong>Training objective:</strong>
-\(L(x, \hat{x}) = ||x - \hat{x}||^2\)</p>
+<!--$$ L(x, \hat{x}) = ||x - \hat{x}||^2 $$--></p>
 
+<d-math block="">
+L(x, \hat{x}) = ||x - \hat{x}||^2
+</d-math>
+<!-- format modified by JZ -->
 <p>By minimizing reconstruction loss, the network learns to capture meaningful low-dimensional structure.</p>
 
 <hr />
@@ -279,42 +309,31 @@ <h3 id="74-variational-autoencoders-vaes">7.4 Variational Autoencoders (VAEs)</h
 
 <h2 id="8-variational-autoencoders-vaes-theory-and-intuition">8. Variational Autoencoders (VAEs): Theory and Intuition</h2>
 
+<!-- format of this section was modified by JZ to address render error. No content changed-->
 <p>VAEs model the data generation process as:</p>
 
 <ol>
   <li>Sample latent variable $z \sim p(z)$ (e.g., $\mathcal{N}(0, I)$).</li>
-  <li>
-    <table>
-      <tbody>
-        <tr>
-          <td>Generate data $x$ from conditional distribution $p_\theta(x</td>
-          <td>z)$.</td>
-        </tr>
-      </tbody>
-    </table>
-  </li>
+  <li>Generate data $x$ from conditional distribution $p_\theta(x \mid z)$.  <!-- format modified by JZ --></li>
 </ol>
 
-<table>
-  <tbody>
-    <tr>
-      <td>The encoder approximates the posterior $q_\phi(z</td>
-      <td>x)$ using neural networks, producing mean and variance parameters $(\mu, \sigma)$.</td>
-    </tr>
-  </tbody>
-</table>
+<p>The encoder approximates the posterior $q_\phi(z \mid x)$ using neural networks, producing mean and variance parameters $(\mu, \sigma)$.</p>
 
 <p>We cannot compute $\log p_\theta(x)$ directly because the marginalization over $z$ is intractable.<br />
-By introducing an approximate posterior $q_\phi(z|x)$ and applying Jensen’s inequality:</p>
+By introducing an approximate posterior $q_\phi(z \mid x)$ and applying Jensen’s inequality:</p>
 
-\[\log p_\theta(x) = \log \int p_\theta(x,z)dz 
-\ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)||p(z))\]
+<d-math block="">
+\log p_\theta(x) = \log \int p_\theta(x,z)dz 
+\ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)||p(z))
+</d-math>
 
 <p>This lower bound is the <strong>Evidence Lower Bound (ELBO)</strong>, which VAEs maximize during training.</p>
 
 <p><strong>Objective (Evidence Lower Bound - ELBO):</strong></p>
 
-\[\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))\]
+<d-math block="">
+\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))
+</d-math>
 
 <ul>
   <li>First term → reconstruction accuracy</li>
@@ -327,7 +346,11 @@ <h2 id="9-the-reparameterization-trick">9. The Reparameterization Trick</h2>
 
 <p>To allow gradients to flow through random sampling, we use:</p>
 
-\[z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]
+<!-- format of this section was modified by JZ to address render error. No content changed-->
+
+<d-math block=""> 
+z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) 
+</d-math>
 
 <p>This separates stochasticity from the deterministic part, making training possible with backpropagation.</p>
 
 
@@ -64,7 +64,7 @@
             "editors": [
 
               {
-                "editor": ""
+                "editor": "Xinyu Pan"
               }
 
             ],
@@ -150,6 +150,11 @@ <h1>Lecture 20</h1>
 <h2 id="the-attention-mechanism">The Attention Mechanism</h2>
 
 <p><strong>Motivation:</strong> Different parts of our input relate to different parts of our output. Sometimes these important relationships can be far apart, like in machine translation. Attention helps us dynamically calculate what is important.</p>
+<h3 id="why-attention-was-needed-long-range-dependency-problem">Why Attention Was Needed (Long-Range Dependency Problem)</h3>
+<p>RNNs compress an entire input sequence into a <strong>single hidden vector</strong>, which makes capturing long-range dependencies difficult.<br />
+During backpropagation, gradients must pass through many time steps, causing <strong>vanishing/exploding gradients</strong>.</p>
+
+<p>Attention solves this by <strong>directly referencing the entire input sequence</strong> when predicting each output, instead of relying on hidden states to store all information.</p>
 
 <ul>
   <li><strong>Origin:</strong> Originally from Natural Language Processing (NLP) and language translation.</li>
@@ -193,6 +198,12 @@ <h4 id="soft-attention-vs-rnn-for-image-captioning">Soft Attention vs. RNN for I
   <p><strong>Aside:</strong> CNNs were an example of <strong>Hard Attention</strong>. As the filter slides over the image, the part of the image inside the filter gets attention weight 1, and the rest gets weight 0.</p>
 </blockquote>
 
+<h3 id="why-attention-reduces-the-need-for-recurrence">Why Attention Reduces the Need for Recurrence</h3>
+<ul>
+  <li>Attention repeatedly refers back to the input, so the hidden state no longer needs to store all global information.</li>
+  <li>This insight led to eliminating recurrence entirely in the Transformer.</li>
+</ul>
+
 <hr />
 
 <h2 id="self-attention">Self Attention</h2>