Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion chapters/part1/ch01_information_compression.tex
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ \subsection{Why Log Loss Is Codelength}

Note: We are discussing optimal codelengths in theory. In practice, you would use arithmetic coding to actually achieve these codelengths when transmitting data. Arithmetic coding can asymptotically achieve the Shannon limit, encoding data using $H(p) + o(1)$ bits on average.

An interesting variant is \textbf{universal coding}: designing codes that work well even when you do not know the true distribution $p$ in advance. Universal codes adapt to the data they see, achieving near optimal compression without requiring perfect prior knowledge. For instance, the Krichevsky-Trofimov estimator achieves regret of $O(\log n)$ when coding binary sequences, meaning the extra codelength compared to knowing $p$ from the start grows only logarithmically with the amount of data. This connects to online learning: as you see more data, your code (and your model) improves, approaching optimality asymptotically.
An interesting variant is \textbf{universal coding}: designing codes that work well even when you do not know the true distribution $p$ in advance. Universal codes adapt to the data they see, achieving near optimal compression without requiring perfect prior knowledge. For instance, the Krichevsky-Trofimov estimator achieves regret of $O(\log n)$ when coding binary sequences, meaning the extra codelength compared to knowing $p$ from the start grows only logarithmically with the amount of data. This connects to machine learning: as you see more data, your code (and your model) improves, approaching optimality asymptotically.

\subsection{Connection to Maximum Likelihood}

Expand Down