Skip to content

RFC: Grammar-Aware Literal Huffman Encoding for Small Structured Payloads (JSON / UTF-8) #4577

@kylesoskin

Description

@kylesoskin

Is your feature request related to a problem? Please describe.

For small, structured payloads (e.g., JSON encoded in UTF-8), zstd’s compression ratio is often dominated by literal entropy rather than LZ77 matches. In these cases:

  • Literal runs are short
  • Substrings rarely repeat
  • Huffman coding operates over the full 256-symbol alphabet
  • Compression can be weak or may occasionally expand the data

This pattern is common for configuration files, API payloads, telemetry events, NDJSON streams, and similar workloads where data is highly structured but not repetitive at the byte level.

The issue is not that zstd performs poorly in general, but that there remains exploitable, local entropy in structured literals that standard Huffman cannot capture without repetition, training, or global assumptions. In addition, many real-world payloads are mixed: structured regions embedded within otherwise unstructured text, where global format assumptions do not hold.


Describe the solution you'd like

I’m proposing (for discussion) a grammar-aware literal encoding mode that reduces literal entropy by constraining the set of valid symbols locally at each byte position.

The core idea:

  • Track lightweight, deterministic grammars (initially JSON + UTF-8) during literal analysis
  • At each byte position, determine the set of symbols that are actually valid given the current grammar state
  • Group bytes by grammar state and encode each group using its own Huffman table over a reduced alphabet
  • Apply grammar constraints opportunistically:
    • If a grammar becomes invalid, its constraints are dropped automatically
    • If grammar validity later resumes, constraints are re-applied
    • When no constraints apply, encoding falls back to standard Huffman behavior

This removes the need for a global “bailout” decision and allows structured and unstructured regions to coexist safely within a single payload.

Key properties:

  • Operates strictly at the byte level (no tokenization, no schema knowledge)
  • Does not require training or prior knowledge of the input
  • Grammar state is reconstructed implicitly during decode (no per-byte metadata)
  • Fully opt-in and heuristic-gated (no effect when not beneficial)
  • Binary-safe and robust to malformed or mixed inputs

A standalone prototype validates feasibility and compression gains. On a small corpus of real-world JSON files (3–128 KB), this approach reduces literal size by approximately 16–30% relative to standard Huffman, with no regressions when the heuristic declines to apply.

While the current implementation is isolated, the design is compatible with zstd’s existing literal histogram and entropy coding model.


Describe alternatives you've considered

  • zstd dictionary training
    Highly effective when representative training data is available, but often impractical for ad-hoc, one-off, or mixed-schema workloads.

  • Higher compression levels
    Improve match finding, but do not address literal entropy when repetition is low or literals dominate.

  • Format-specific compressors (e.g., JSON-aware tools)
    Typically rely on tokenization or schema awareness, are brittle on malformed input, and cannot be transparently composed into a general-purpose compressor like zstd.

  • Arithmetic coding
    Explored during prototyping, but Huffman was retained for simplicity, determinism, and alignment with zstd’s existing infrastructure.


Additional context

A working prototype has been implemented outside the main zstd pipeline to validate:

  • Measurable compression gains on structured UTF-8 text
  • Correct round-trip behavior
  • Graceful, local fallback on malformed, partially structured, or non-JSON / non-UTF-8 inputs
  • Bounded overhead via heuristics, bucket limits, and minimum-size thresholds

The literal sub-format is versioned and self-describing, and no changes to the zstd frame format are required.

I wanted to start with a design discussion before proposing any pipeline or default-behavior changes. If there is interest, I’m happy to follow up with detailed design notes, benchmarks, and a full RFC-style proposal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions