diff --git a/chapters/part1/ch01_information_compression.tex b/chapters/part1/ch01_information_compression.tex index d76fc65..d97d3e5 100644 --- a/chapters/part1/ch01_information_compression.tex +++ b/chapters/part1/ch01_information_compression.tex @@ -62,7 +62,7 @@ \subsection{Why Log Loss Is Codelength} Note: We are discussing optimal codelengths in theory. In practice, you would use arithmetic coding to actually achieve these codelengths when transmitting data. Arithmetic coding can asymptotically achieve the Shannon limit, encoding data using $H(p) + o(1)$ bits on average. -An interesting variant is \textbf{universal coding}: designing codes that work well even when you do not know the true distribution $p$ in advance. Universal codes adapt to the data they see, achieving near optimal compression without requiring perfect prior knowledge. For instance, the Krichevsky-Trofimov estimator achieves regret of $O(\log n)$ when coding binary sequences, meaning the extra codelength compared to knowing $p$ from the start grows only logarithmically with the amount of data. This connects to online learning: as you see more data, your code (and your model) improves, approaching optimality asymptotically. +An interesting variant is \textbf{universal coding}: designing codes that work well even when you do not know the true distribution $p$ in advance. Universal codes adapt to the data they see, achieving near optimal compression without requiring perfect prior knowledge. For instance, the Krichevsky-Trofimov estimator achieves regret of $O(\log n)$ when coding binary sequences, meaning the extra codelength compared to knowing $p$ from the start grows only logarithmically with the amount of data. This connects to machine learning: as you see more data, your code (and your model) improves, approaching optimality asymptotically. \subsection{Connection to Maximum Likelihood}