This repository contains a L-Store style columnar storage engine in Python. It shows core L-Store ideas (base vs tail pages, indirection, schema encoding, merge/compaction, page directory, and a bufferpool). The code is organized as a small DB engine plus demo/test harnesses to exercise typical operations (insert/select/update/delete/aggregates/range queries).
- Data is stored per column (each column uses fixed-size pages).
- Each logical record has a base record (stable copy) and subsequent tail records for updates.
- A page directory maps RIDs to the corresponding page range, base/tail page index, row offset and status flags.
- Indirection column points from a base record to the latest tail record (if any).
- Schema encoding uses a bitmask to describe which columns are present in a tail record.
- A merge (compaction) process periodically consolidates tail records back into base pages.
- A bufferpool manages in-memory page frames and writes pages to/from disk.
This makes it easy to model update-heavy workloads without rewriting base pages on every update and enables column-wise storage/access for analytical queries.
- Base page: Stores the original or consolidated version of tuples for a contiguous RID range. Base pages are relatively stable.
- Tail page: Stores updates (deltas) to columns for certain tuples. Multiple updates can be chained in tail pages.
- Indirection column: For each base RID, holds a pointer (RID) to the latest tail record (if updated). If no update, it points to itself.
- Schema encoding: Integer bitmask telling which columns were updated in a tail record (so non-updated columns are omitted there).
- Page directory: Maps a logical RID to physical location: page range index, base/tail page index, offset, and whether it's base or tail.
- Merge / compaction: Reads tail pages, applies latest updates to base records and writes back consolidated base pages, resetting relevant metadata.
- Bufferpool: In-memory cache of loaded page frames (pages for each column), with dirty-bit and basic eviction/commit behavior.
Lstore_database-main/
├─ README.md # original brief README
├─ __main__.py # demo/main runner (benchmarks)
├─ demo.py # interactive test/demo harness
├─ m1_* / m2_* / m3_* # milestone/tester scripts
└─ lstore/
├─ __init__.py
├─ config.py # page/sizing constants and special values
├─ page.py # Page model + disk read/write helpers
├─ bufferpool.py # Bufferpool and Frame classes
├─ db.py # Database wrapper (open/create/close)
├─ table.py # Table, page range, allocation and merge logic
├─ record.py # Record representation
├─ query.py # User-facing Query API (insert/select/update/delete/agg)
├─ index.py # Index wrapper (uses BTrees.IOBTree)
├─ transaction.py # Transaction skeleton
├─ transaction_worker.py # (worker thread example)
└─ btreetest.py # B-tree basic tests/usage
Defines engine-wide constants:
PAGE_SIZE(4096 bytes)PAGE_RECORD_SIZE(8 bytes)BASE_PAGE_COUNT(# of base pages per page range)META_COLUMN_COUNT(number of metadata columns: indirection, RID, base_rid, timestamp, schema)ENTRIES_PER_PAGEcomputedSPECIAL_NULL_VALUEsentinel
These constants determine page layout and serialization rules.
A small page abstraction that:
- Maintains a
bytearrayof sizePAGE_SIZE. - Supports
write(value, row)andread(row)storing 8-byte values per ROW (fixed-size). read_from_disk/write_to_diskhelpers that read/write pages for all columns for a page frame.
Note: pages store 8-byte values per record slot and this project uses fixed 8-byte cells (integer/encoded values).
Frame: stores in-memory pages for all columns of that page (a "page frame").Bufferpool: loads page frames from disk on demand, keeps a mapping of frames, supports dirty-bit, pin (Lock), and commits frames back to disk.- Tracks frames with
frame_dirmapping(table_name, page_range, page_index, is_base) -> frame_index.
This organizes I/O and caching; eviction / replacement is simple and oriented towards educational clarity.
- Manages a table's on-disk layout under
table_path/directories. - Maintains
page_directory,page_ranges, allocating base/tail pages and tail page growth. - Implements
merge(rid)which runs consolidation: reads base & tail records, applies latest column values and writes consolidated base pages; resets metadata appropriately. write_new_recfor writing a brand new base record (on insert), and helpers to read a record by RID.
- Thin wrapper around an on-disk/ in-memory B-tree (it uses
BTrees.IOBTree.IOBTree). - Stores a B-Tree per column (created on demand).
- Provides
locate(column, value)and range locate helpers.
Recordobject that holds metadata (meta_data = [INDIRECTION, RID, base_rid, timestamp, schema_encoding]) and the data columns.
- Public API used by clients and demo scripts:
insert(),select(),update(),delete(),sum(),count(),average(),range_query(),increment()etc. - Coordinates with
TableandIndexfor record-level operations and uses schema encoding/indirection for updates.
Databasewrapper that allowsopen(path),create_table(name, num_columns, key),get_table(name), andclose()which persists metadata (page directory, indices) and commits bufferpool frames.
- Basic transaction skeleton that aggregates queries and attempts to run them with lightweight locking semantics. The transaction concurrency model is intentionally simple (teaching-focused).
- Python 3.x
BTreesmodule (part of ZODB packages)(or:pip install BTrees
python3 -m pip install BTrees)
From repo root:
python3 demo.pypython3 __main__.pyThere are many tester scripts included. Example:
python3 m1_tester.py
python3 m2_tester_part1.py
python3 m3_tester_part_1.pyThese scripts exercise insert/select/update/delete and aggregated queries which are useful for regression checks.
# minimal_example.py
from lstore.db import Database
from lstore.query import Query
db = Database()
db.open('./main_db') # bufferpool created and DB root initialized
grades = db.create_table('Grades', 5, 0) # num_columns=5, key column index=0
q = Query(grades)
# Insert: (key, col0, col1, col2, col3, col4)
q.insert(1001, 10, 20, 30, 40, 50)
# Select by key
rows = q.select(1001, 0, [1,1,1,1,1]) # returns Record objects
print(rows[0].columns)
# Update
q.update(1001, None, 21, None, None, None)
# Aggregate
sum_ = q.sum(1000, 2000, 1) # sum column 1 for key range [1000..2000]
db.close()Notes about API shapes:
Query.insert(key, col0, col1, ...)— first argument is key (the column index specified when creating the table).Query.select(key, column_index, query_columns_bitmask_list)—query_columnsis a list of booleans or 0/1 to specify which columns you want returned.Query.update(key, *columns)— passNonefor columns you don't want to change.
- Pages are fixed-size byte arrays. Each record slot stores a fixed 8-byte integer value.
- Each column has a separate
.binfile per page (columnar layout). When a page frame is loaded, it includesnum_columns + META_COLUMN_COUNTpages (metadata columns are stored alongside data columns).
- Tables are split into page ranges, each containing
BASE_PAGE_COUNTbase pages and an extendable tail page folder. - On allocation, the engine creates base pages and allocates tail pages as required.
update()writes a new tail record containing only changed column values and a schema-encoding bitmask describing which columns are present.- The base record’s indirection column is updated to point to the tail RID.
- Read logic checks indirection and schema encoding to reconstruct latest values.
table.merge(rid)consolidates updates from tail pages for a page-range into base pages. Merge copies the latest values into base pages and resets indirection/schema accordingly.- A simple merge bufferpool is used to isolate merge I/O.
- Index is a B-Tree per column (created lazily). The BTree implementation comes from
BTrees.IOBTree.IOBTree.
- The project includes a simple
Transactionclass that groups queries into a single logical unit and aQuerylock for table operations. Bufferpool.Frameincludes apin(threading.Lock) to avoid conflicting concurrent modifications while a frame is in use.- Locking is coarse-grained / teaching-focused — not a production-grade MVCC implementation. The transaction worker skeleton (
transaction_worker.py) demonstrates how transactions might be scheduled.
m1_tester.py,m1_extended_main.py,m1_extended_tester.py: milestone 1 (insert/select basics).m2_tester_part1.py,m2_tester_part2.py,m2_extended_*.py: updates and merges.m3_tester*: more complex workloads.
You can run these scripts to validate engine behavior.
- Fork the repo.
- Open a branch for your feature/fix.
- Add tests and update README or docs.
- Open a PR describing the change.
- Natheethorn Teacharuangchit
- Michael Shaw
- Stuart Feng
- Henry Chou
- Eric Wang