Dremel Demo

You can play around with a demo here ()

1. Overview

This project provides a Python reference implementation of the Dremel columnar storage algorithm, as described in the original Google paper ("Dremel: Interactive Analysis of Web-Scale Datasets").

It demonstrates the core concepts of:

Record Shredding: Decomposing nested and repeated records (JSON-like) into columnar data with Repetition and Definition levels.
Record Assembly: Reconstructing original records from columnar data using Finite State Machines (FSMs).

The implementation includes:

shred.py: Logic for shredding records into columns.
assembly.py: Logic for assembling records from columns.
fsm.py: Construction of the FSM used for efficient record assembly.
schema.py: Schema definition and parsing helpers.

2. Example Usage

from schema import parse_schema
from shred import shred_records
from assembly import assemble_records

# 1. Define Schema
# The schema is defined as a list of paths.
# [*] denotes a repeated field.
schema_strings = ["doc.title", "doc.links[*].url"]
schema = parse_schema(schema_strings)

# 2. Input Records
records = [
    {
        "doc": {
            "title": "Dremel Paper",
            "links": [{"url": "http://google.com"}, {"url": "http://cs.stanford.edu"}]
        }
    },
    {
        "doc": {
            "title": "Another Doc"
            # links is missing (optional)
        }
    }
]

# 3. Shred Records (Write to Columns)
# Returns a dict mapping ColumnDescriptor -> list of (value, repetition_level, definition_level)
shredded_columns = shred_records(schema, records)

# 4. Assemble Records (Read from Columns)
# Reconstructs the records from the columnar data
assembled_records = assemble_records(schema, shredded_columns)
print(assembled_records)

3. Assumptions / Limitations

In-Memory: This is a demonstration implementation and operates entirely in memory. It is not intended for production-scale data processing.
Schema Support: Supports nested and repeated fields (groups) and primitive types (integers, strings). All fields are optional.
Validation: Assumes input records strictly conform to the provided schema.
Performance: Optimized for clarity and correctness over raw speed.

4. TODO

Support lossless encoding and decoding of empty sub-messages.
Support lazy shredding of sparse records via writer versions (optimization for sparse data).

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
app.py		app.py
assembly.py		assembly.py
assembly_test.py		assembly_test.py
example_usage.py		example_usage.py
fsm.py		fsm.py
fsm_test.py		fsm_test.py
paper_schema.py		paper_schema.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
schema.py		schema.py
schema_test.py		schema_test.py
shred.py		shred.py
shred_test.py		shred_test.py
test_utils.py		test_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dremel Demo

1. Overview

2. Example Usage

3. Assumptions / Limitations

4. TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dremel Demo

1. Overview

2. Example Usage

3. Assumptions / Limitations

4. TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages