Datalake

A personal data lake for managing audio recordings, transcripts, and screenshots with fast SQLite-based queries and organized filesystem storage.

Features

SQLite Database: Zero-server-overhead queries with full-text search
Filesystem Organization: Human-readable date-based structure (YYYY/MM/DD)
Fast Access: Optimized for <10ms queries with proper indexes
Full-Text Search: Search transcripts using SQLite FTS5
Comprehensive Logging: Structured logging with timestamps for all operations
Tested: 35+ comprehensive tests ensuring reliability
Containerized: Docker support with isolated environment variables

Quick Start (Docker - Recommended)

The datalake runs as a Docker container with both a REST API (for network access) and CLI tools.

Automated Setup (via local-bootstrapping)

On Laptop (where database runs):

cd ~/Programs/local-bootstrapping
./scripts/setup-datalake.sh

This will:

Set up the Datalake API service on port 8766
Configure auto-start on login
Enable self-healing (restarts if crashes)
Run health checks every 5 minutes

The API will be available at:

Local: http://localhost:8766
Network: http://192.168.29.137:8766 (or your laptop IP)
Tailscale: http://100.103.8.87:8766

Manual Setup

Prerequisites

Docker
Docker Compose

Setup

Build the container:

docker-compose build

Initialize the database:

./docker-init.sh

Start the API server:

docker-compose up -d

The API will be running on port 8766.

Ingest audio files:

./docker-ingest.sh path/to/audio.wav "meeting,important"

Query data (CLI):

./docker-query.sh

Query data (API):

curl http://localhost:8766/api/v1/audio
curl http://localhost:8766/api/v1/stats

Run tests:

./docker-test.sh

Docker Helper Scripts

docker-init.sh - Initialize database and directory structure
docker-ingest.sh <file> [tags] - Ingest audio files into the datalake
docker-query.sh - Open interactive query interface
docker-test.sh [args] - Run pytest tests in container

Docker Environment Variables

The container uses isolated environment variables (defined in Dockerfile and docker-compose.yml):

DATA_DIR=/data              # Persistent data storage (Docker volume)
DB_FILE=/data/datalake.db   # SQLite database location
LOG_DIR=/app/logs           # Log files (Docker volume)
PROJECT_ROOT=/app           # Application root
SCHEMA_FILE=/app/schema.sql # Database schema

Docker Volumes

Three persistent volumes are created:

datalake-data - Audio, transcripts, screenshots, and database
datalake-logs - Application logs
datalake-db - Reserved for future use

To inspect volumes:

docker volume ls | grep datalake
docker volume inspect datalake-data

To backup data:

docker run --rm -v datalake-data:/data -v $(pwd):/backup alpine tar czf /backup/datalake-backup.tar.gz /data

Directory Structure

~/Programs/datalake/
├── README.md                 # This file
├── schema.sql                # Database schema
├── datalake.db              # SQLite database
├── pyproject.toml           # Python project configuration
├── uv.lock                  # Dependency lock file
├── scripts/
│   ├── init.sh              # Initialize database and directories
│   ├── ingest-audio.sh      # Ingest audio files
│   └── query.sh             # Interactive query interface
├── tests/                   # Comprehensive test suite
│   ├── conftest.py
│   ├── test_database.py
│   ├── test_ingestion.py
│   ├── test_init_script.py
│   └── test_queries.py
├── logs/                    # Log files
│   ├── init.log
│   ├── ingest-audio.log
│   └── query.log
└── data/                    # Data storage (or symlink to SSD)
    ├── audio/YYYY/MM/DD/*.wav
    ├── transcripts/YYYY/MM/DD/*.txt
    └── screenshots/YYYY/MM/DD/*.png

Database Schema

Tables

`audio`

Stores audio file metadata.

Column	Type	Description
id	INTEGER	Primary key
file_path	TEXT	Relative path from data directory
filename	TEXT	Current filename
original_filename	TEXT	Original filename before ingestion
duration_seconds	REAL	Audio duration in seconds
format	TEXT	Audio format (wav, mp3, flac, etc.)
sample_rate	INTEGER	Sample rate in Hz
channels	INTEGER	Number of audio channels
size_bytes	INTEGER	File size in bytes
tags	TEXT	Comma-separated tags
created_at	TEXT	ISO 8601 timestamp
ingested_at	TEXT	When file was added to datalake
metadata	TEXT	Additional JSON metadata

`transcripts`

Stores transcript files and content.

Column	Type	Description
id	INTEGER	Primary key
file_path	TEXT	Relative path from data directory
filename	TEXT	Transcript filename
audio_id	INTEGER	Foreign key to audio table
content	TEXT	Full transcript text
word_count	INTEGER	Number of words
language	TEXT	Language code (e.g., 'en')
confidence	REAL	Transcription confidence (0.0-1.0)
provider	TEXT	Provider (e.g., 'assemblyai', 'whisper')
size_bytes	INTEGER	File size in bytes
tags	TEXT	Comma-separated tags
created_at	TEXT	ISO 8601 timestamp
ingested_at	TEXT	When file was added to datalake
metadata	TEXT	Additional JSON metadata

`screenshots`

Stores screenshot metadata.

Column	Type	Description
id	INTEGER	Primary key
file_path	TEXT	Relative path from data directory
filename	TEXT	Screenshot filename
width	INTEGER	Image width in pixels
height	INTEGER	Image height in pixels
format	TEXT	Image format (png, jpg, etc.)
size_bytes	INTEGER	File size in bytes
tags	TEXT	Comma-separated tags
created_at	TEXT	ISO 8601 timestamp
ingested_at	TEXT	When file was added to datalake
metadata	TEXT	Additional JSON metadata

`transcripts_fts`

FTS5 virtual table for full-text search on transcripts.

Indexes

Optimized indexes for common queries:

idx_audio_created_at: Fast queries by date
idx_audio_tags: Fast tag filtering
idx_audio_format: Filter by audio format
idx_transcripts_created_at: Fast queries by date
idx_transcripts_audio_id: Join optimization
idx_transcripts_tags: Fast tag filtering
idx_screenshots_created_at: Fast queries by date
idx_screenshots_tags: Fast tag filtering

REST API Access

The datalake runs a FastAPI server on port 8766 providing network access to the database.

API Endpoints

Base URL: http://localhost:8766 (or laptop IP for network access)

Interactive Documentation:

Swagger UI: http://localhost:8766/docs
ReDoc: http://localhost:8766/redoc

Core Endpoints:

# Health check
GET /health

# List audio files (with pagination and filtering)
GET /api/v1/audio?limit=10&offset=0&tags=meeting

# Get specific audio file
GET /api/v1/audio/{id}

# List transcripts
GET /api/v1/transcripts?limit=10&offset=0

# Get specific transcript
GET /api/v1/transcripts/{id}

# Search transcripts (FTS5 full-text search)
GET /api/v1/search/transcripts?q=search_term&limit=10

# Get database statistics
GET /api/v1/stats

# List screenshots
GET /api/v1/screenshots?limit=10&offset=0

Example Usage:

# From desktop, access laptop database
curl http://100.103.8.87:8766/api/v1/audio | jq

# Search transcripts
curl "http://100.103.8.87:8766/api/v1/search/transcripts?q=meeting" | jq

# Get stats
curl http://100.103.8.87:8766/api/v1/stats | jq

Network Access

From Desktop to Laptop:

Tailscale: http://100.103.8.87:8766
Local Network: http://192.168.29.137:8766 (find IP with ip addr)

CORS: Enabled for all origins (suitable for local network use)

Managing the Service

If set up via local-bootstrapping (auto-start enabled):

# Check status
systemctl --user status datalake.service

# Restart service
systemctl --user restart datalake.service

# Stop service
systemctl --user stop datalake.service

# Start service
systemctl --user start datalake.service

# View service logs
systemctl --user logs -f datalake.service

# View container logs
cd ~/Programs/datalake
docker-compose logs -f

# Disable auto-start
systemctl --user disable datalake.service

Manual Docker management:

# Start server
docker-compose up -d

# Stop server
docker-compose down

# Restart server
docker-compose restart

# View logs
docker-compose logs -f

# Rebuild and restart
docker-compose build && docker-compose up -d

Web UI

The datalake includes a Flask-based web interface on port 5050 for browsing data and monitoring.

Starting the Web UI

Automated (via systemd):

# Enable and start
systemctl --user enable --now datalake-web.service

# Check status
systemctl --user status datalake-web.service

# View logs
journalctl --user -u datalake-web.service -f

Manual:

cd ~/Programs/datalake
uv run python3 -m web.app --port 5050

Web UI Endpoints

URL	Description
`http://localhost:5050/`	Home page with stats
`http://localhost:5050/sessions`	Browse Claude sessions
`http://localhost:5050/voice`	Voice typing sessions
`http://localhost:5050/chatgpt`	ChatGPT conversations
`http://localhost:5050/search`	Full-text search
`http://localhost:5050/memory`	Memory monitoring dashboard
`http://localhost:5050/memory/events`	Memory event timeline

Memory Monitoring

Monitor Claude Code memory usage with real-time charts and alerts.

Overview

The memory monitoring system collects RAM metrics from Claude processes and stores them in the datalake for visualization and analysis.

Data sources:

/var/log/claude-memory/metrics.jsonl - Time-series RAM data (every 10 seconds)
/var/log/claude-memory/events.jsonl - Memory events (warnings, kills, restarts)

Database Tables

-- Memory metrics (time-series RAM data)
memory_metrics (
    pid, session_id, rss_mb, memory_rate_mb_min,
    timestamp, source_device
)

-- Memory events (warnings, kills, restarts)
memory_events (
    event_type, pid, severity, message,
    details, timestamp
)

Dashboard Features

Memory Dashboard (/memory):

Real-time RAM usage chart (per PID)
Rate of change graph (MB/min)
Active sessions list with current RAM
Low-memory mode toggle per session

Event Timeline (/memory/events):

Chronological event list
Filter by event type
Color-coded by severity

API Endpoints

Flask Web UI (port 5050):

# Get chart data (for AJAX updates)
GET /api/memory/chart-data

# List active sessions
GET /api/memory/sessions

# Toggle low-memory mode for a session
POST /api/memory/sessions/{pid}/low-memory-mode
# Body: {"enabled": true}

FastAPI REST (port 8766):

# Get today's metrics
GET /api/v1/memory/metrics/today

# Get metrics for date range
GET /api/v1/memory/metrics/range?start=2026-01-01&end=2026-01-14

# Get today's events
GET /api/v1/memory/events/today

# Get events for date range
GET /api/v1/memory/events/range?start=2026-01-01&end=2026-01-14

# List sessions
GET /api/v1/memory/sessions

# Toggle low-memory mode
POST /api/v1/memory/sessions/{pid}/low-memory-mode?enabled=true

Ingesting Memory Data

From desktop (via SSH):

# One-time ingestion
./scripts/ingest-memory.sh --once

# Watch mode (continuous)
./scripts/ingest-memory.sh --watch

# Check connectivity
./scripts/ingest-memory.sh --check

Direct parser usage:

# Parse local log files
uv run python3 -m parsers.memory_parser --log-dir /var/log/claude-memory

# Parse from specific device
uv run python3 -m parsers.memory_parser --device desktop

# Test mode (dry run)
uv run python3 -m parsers.memory_parser --test

Low-Memory Mode

When enabled for a session, creates a control file at:

/var/log/claude-memory/low-memory-mode/{pid}

The Claude memory monitoring hook reads this file and injects low-memory guidance.

Getting Started (CLI Tools)

Prerequisites

Bash shell
SQLite3
Python 3.13+ (for running tests)
uv (Python package manager)
ffprobe (optional, for audio metadata extraction)

Installation

Clone or navigate to the datalake directory:

cd ~/Programs/datalake

Initialize the database and directory structure:

./scripts/init.sh

This will:

Create the SQLite database from schema.sql
Set up the data directory structure
Set proper file permissions (755 for directories, 644 for files)
Create log directories

Basic Usage

Ingest Audio Files

# Basic ingestion
./scripts/ingest-audio.sh path/to/audio.wav

# With tags
./scripts/ingest-audio.sh recording.wav "meeting,work,important"

The script will:

Copy the file to data/audio/YYYY/MM/DD/
Extract metadata using ffprobe (if available)
Insert a record into the SQLite database
Log all operations

Query Data

./scripts/query.sh

This opens an interactive menu with options:

List recent audio (last 10)
List recent transcripts (last 10)
List recent screenshots (last 10)
Search transcripts (full-text)
Show audio by tags
Show statistics
Custom SQL query
Open SQLite shell

Direct SQLite Queries

# List all audio files
sqlite3 datalake.db "SELECT filename, duration_seconds, tags FROM audio ORDER BY created_at DESC LIMIT 10;"

# Search transcripts
sqlite3 datalake.db "SELECT t.filename, snippet(transcripts_fts, 0, '>>>', '<<<', '...', 40) FROM transcripts t JOIN transcripts_fts ON t.id = transcripts_fts.rowid WHERE transcripts_fts MATCH 'search terms';"

# Get statistics
sqlite3 datalake.db "SELECT COUNT(*) as total, ROUND(SUM(duration_seconds)/60, 2) as minutes FROM audio;"

Running Tests

The project includes a comprehensive test suite with 35+ tests.

# Run all tests
uv run pytest -v

# Run specific test file
uv run pytest tests/test_database.py -v

# Run with coverage
uv run pytest --cov=. --cov-report=html

Test coverage includes:

Database schema validation
Audio ingestion workflow
Query functionality
Error handling
File permissions
Logging verification
Full-text search
Data integrity constraints

Logging

All operations are logged with structured timestamps:

[2026-01-10T12:00:00+05:30] [INFO] Starting ingestion of: audio.wav
[2026-01-10T12:00:00+05:30] [INFO] File copied successfully
[2026-01-10T12:00:00+05:30] [INFO] Record inserted successfully with ID: 42

Log files are stored in logs/:

init.log: Initialization operations
ingest-audio.log: Audio ingestion operations
query.log: Query operations

Data Migration

From omarchy-voice-typing

If you have existing data in ~/Programs/recordings/ and ~/Programs/transcripts/, you can migrate it:

# Initialize datalake first
./scripts/init.sh

# Migrate audio files
for file in ~/Programs/recordings/*.wav; do
    ./scripts/ingest-audio.sh "$file" "migrated,voice-typing"
done

# TODO: Create similar script for transcripts and screenshots

Manual Migration Script

Create a migration script:

#!/usr/bin/env bash
# migrate.sh - Migrate existing data to datalake

SOURCE_AUDIO=~/Programs/recordings
SOURCE_TRANSCRIPTS=~/Programs/transcripts

# Migrate audio
echo "Migrating audio files..."
for audio in "$SOURCE_AUDIO"/*.wav; do
    [ -f "$audio" ] || continue
    echo "Ingesting: $audio"
    ./scripts/ingest-audio.sh "$audio" "migrated"
done

# TODO: Add transcript and screenshot migration
echo "Migration complete!"

Environment Variables

Scripts support environment variables for flexibility:

DATA_DIR: Override data directory location (default: ./data)
DB_FILE: Override database file path (default: ./datalake.db)
LOG_DIR: Override log directory (default: ./logs)
PROJECT_ROOT: Override project root (default: script parent directory)
SCHEMA_FILE: Override schema file path (default: ./schema.sql)

Example:

# Use custom data directory on SSD
export DATA_DIR=/mnt/data-ssd/datalake
./scripts/init.sh

Future Enhancements

Contributing

When making changes:

Update relevant tests in tests/
Run the full test suite: uv run pytest -v
Ensure all tests pass
Update this README if adding features
Follow the coding style in existing scripts

License

Personal project - see owner information in CLAUDE.md

Contact

GitHub: prabhanshu11
Email: mail.prabhanshu@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
api		api
bin		bin
lib		lib
parsers		parsers
profiler		profiler
scripts		scripts
search		search
tests		tests
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
DASHBOARD_REQUIREMENTS.md		DASHBOARD_REQUIREMENTS.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-ingest.sh		docker-ingest.sh
docker-init.sh		docker-init.sh
docker-query.sh		docker-query.sh
docker-test.sh		docker-test.sh
main.py		main.py
pyproject.toml		pyproject.toml
schema.sql		schema.sql
schema_v2.sql		schema_v2.sql
schema_v3_memory.sql		schema_v3_memory.sql
status_datalake.md		status_datalake.md
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Datalake

Features

Quick Start (Docker - Recommended)

Automated Setup (via local-bootstrapping)

Manual Setup

Prerequisites

Setup

Docker Helper Scripts

Docker Environment Variables

Docker Volumes

Directory Structure

Database Schema

Tables

audio

transcripts

screenshots

transcripts_fts

Indexes

REST API Access

API Endpoints

Network Access

Managing the Service

Web UI

Starting the Web UI

Web UI Endpoints

Memory Monitoring

Overview

Database Tables

Dashboard Features

API Endpoints

Ingesting Memory Data

Low-Memory Mode

Getting Started (CLI Tools)

Prerequisites

Installation

Basic Usage

Ingest Audio Files

Query Data

Direct SQLite Queries

Running Tests

Logging

Data Migration

From omarchy-voice-typing

Manual Migration Script

Environment Variables

Future Enhancements

Contributing

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`audio`

`transcripts`

`screenshots`

`transcripts_fts`

Packages