This folder contains a fully modular workflow for the Market Data Storage and Querying with SQLite3 and Parquet assignment. Each concern lives in its own module so you can mix and match components depending on whether you are demonstrating ingestion, SQL analytics, Parquet analytics, or cross-format comparisons.
| File / Dir | Purpose |
|---|---|
data_loader.py |
Loads the raw CSVs, validates ticker references, and returns ready-to-ingest DataFrames. |
sqlite_storage.py |
Creates the SQLite schema, inserts data, and exposes the four SQL query tasks. |
parquet_storage.py |
Materializes a ticker-partitioned Parquet dataset and implements the three Parquet analytics tasks plus storage comparisons. |
assignment10.py |
Command-line entry point that orchestrates ingestion, querying, and reporting. |
market_data.db |
SQLite3 database generated by the script (deleted/recreated when running with --rebuild). |
market_data/ |
Parquet directory partitioned by ticker for fast symbol-specific scans. |
query_tasks.md |
Narrative of SQL & Parquet query outputs along with observed runtimes. |
comparison.md |
Discussion of SQLite vs Parquet storage characteristics and recommended use cases. |
tests/ |
Unit tests covering ingestion and representative queries for both storage layers. |
Source CSVs (market_data_multi.csv, tickers.csv) plus the schema definition remain unchanged and live alongside these modules.
cd assignment-10
python3 assignment10.py --rebuild--rebuilddrops the existing SQLite database and Parquet directory before re-loading from the CSVs. Omit the flag to reuse prior artifacts and skip the expensive load step.- Use
--results /path/to/file.jsonto change the location of the JSON summary (results.jsonby default).
Dependencies: pandas (already available) and pyarrow (install once via pip3 install pyarrow). No other services are needed.
- TSLA windowed extract – Pulls every Tesla trade between 2025‑11‑17 and 2025‑11‑18 and captures the query latency for the storage comparison.
- Average daily volume per ticker – Aggregates each ticker’s average volume by trade date.
- Top 3 tickers by total return – Uses first open and last close prices over the sample period to rank tickers.
- First/last trade price per day – Returns closing prices for the first and last minute bar per ticker per day.
- AAPL 5-minute rolling mean – Computes rolling averages over the
closecolumn using the partitioned Parquet dataset. - 5-day rolling volatility – Resamples to daily closes, calculates daily returns, and builds a rolling standard deviation per ticker.
- SQLite vs Parquet comparison – Compares runtime and file sizes for the TSLA window query between the two storage formats.
Open results.json after running the script to see the latest row counts, timings, and storage comparisons; re-run the script whenever you need fresh metrics or want to regenerate the SQLite/Parquet artifacts from the source CSVs.