A personal data lake for managing audio recordings, transcripts, and screenshots with fast SQLite-based queries and organized filesystem storage.
- SQLite Database: Zero-server-overhead queries with full-text search
- Filesystem Organization: Human-readable date-based structure (YYYY/MM/DD)
- Fast Access: Optimized for <10ms queries with proper indexes
- Full-Text Search: Search transcripts using SQLite FTS5
- Comprehensive Logging: Structured logging with timestamps for all operations
- Tested: 35+ comprehensive tests ensuring reliability
- Containerized: Docker support with isolated environment variables
The datalake runs as a Docker container with both a REST API (for network access) and CLI tools.
On Laptop (where database runs):
cd ~/Programs/local-bootstrapping
./scripts/setup-datalake.shThis will:
- Set up the Datalake API service on port 8766
- Configure auto-start on login
- Enable self-healing (restarts if crashes)
- Run health checks every 5 minutes
The API will be available at:
- Local:
http://localhost:8766 - Network:
http://192.168.29.137:8766(or your laptop IP) - Tailscale:
http://100.103.8.87:8766
- Docker
- Docker Compose
- Build the container:
docker-compose build- Initialize the database:
./docker-init.sh- Start the API server:
docker-compose up -dThe API will be running on port 8766.
- Ingest audio files:
./docker-ingest.sh path/to/audio.wav "meeting,important"- Query data (CLI):
./docker-query.sh- Query data (API):
curl http://localhost:8766/api/v1/audio
curl http://localhost:8766/api/v1/stats- Run tests:
./docker-test.shdocker-init.sh- Initialize database and directory structuredocker-ingest.sh <file> [tags]- Ingest audio files into the datalakedocker-query.sh- Open interactive query interfacedocker-test.sh [args]- Run pytest tests in container
The container uses isolated environment variables (defined in Dockerfile and docker-compose.yml):
DATA_DIR=/data # Persistent data storage (Docker volume)
DB_FILE=/data/datalake.db # SQLite database location
LOG_DIR=/app/logs # Log files (Docker volume)
PROJECT_ROOT=/app # Application root
SCHEMA_FILE=/app/schema.sql # Database schemaThree persistent volumes are created:
datalake-data- Audio, transcripts, screenshots, and databasedatalake-logs- Application logsdatalake-db- Reserved for future use
To inspect volumes:
docker volume ls | grep datalake
docker volume inspect datalake-dataTo backup data:
docker run --rm -v datalake-data:/data -v $(pwd):/backup alpine tar czf /backup/datalake-backup.tar.gz /data~/Programs/datalake/
├── README.md # This file
├── schema.sql # Database schema
├── datalake.db # SQLite database
├── pyproject.toml # Python project configuration
├── uv.lock # Dependency lock file
├── scripts/
│ ├── init.sh # Initialize database and directories
│ ├── ingest-audio.sh # Ingest audio files
│ └── query.sh # Interactive query interface
├── tests/ # Comprehensive test suite
│ ├── conftest.py
│ ├── test_database.py
│ ├── test_ingestion.py
│ ├── test_init_script.py
│ └── test_queries.py
├── logs/ # Log files
│ ├── init.log
│ ├── ingest-audio.log
│ └── query.log
└── data/ # Data storage (or symlink to SSD)
├── audio/YYYY/MM/DD/*.wav
├── transcripts/YYYY/MM/DD/*.txt
└── screenshots/YYYY/MM/DD/*.png
Stores audio file metadata.
| Column | Type | Description |
|---|---|---|
| id | INTEGER | Primary key |
| file_path | TEXT | Relative path from data directory |
| filename | TEXT | Current filename |
| original_filename | TEXT | Original filename before ingestion |
| duration_seconds | REAL | Audio duration in seconds |
| format | TEXT | Audio format (wav, mp3, flac, etc.) |
| sample_rate | INTEGER | Sample rate in Hz |
| channels | INTEGER | Number of audio channels |
| size_bytes | INTEGER | File size in bytes |
| tags | TEXT | Comma-separated tags |
| created_at | TEXT | ISO 8601 timestamp |
| ingested_at | TEXT | When file was added to datalake |
| metadata | TEXT | Additional JSON metadata |
Stores transcript files and content.
| Column | Type | Description |
|---|---|---|
| id | INTEGER | Primary key |
| file_path | TEXT | Relative path from data directory |
| filename | TEXT | Transcript filename |
| audio_id | INTEGER | Foreign key to audio table |
| content | TEXT | Full transcript text |
| word_count | INTEGER | Number of words |
| language | TEXT | Language code (e.g., 'en') |
| confidence | REAL | Transcription confidence (0.0-1.0) |
| provider | TEXT | Provider (e.g., 'assemblyai', 'whisper') |
| size_bytes | INTEGER | File size in bytes |
| tags | TEXT | Comma-separated tags |
| created_at | TEXT | ISO 8601 timestamp |
| ingested_at | TEXT | When file was added to datalake |
| metadata | TEXT | Additional JSON metadata |
Stores screenshot metadata.
| Column | Type | Description |
|---|---|---|
| id | INTEGER | Primary key |
| file_path | TEXT | Relative path from data directory |
| filename | TEXT | Screenshot filename |
| width | INTEGER | Image width in pixels |
| height | INTEGER | Image height in pixels |
| format | TEXT | Image format (png, jpg, etc.) |
| size_bytes | INTEGER | File size in bytes |
| tags | TEXT | Comma-separated tags |
| created_at | TEXT | ISO 8601 timestamp |
| ingested_at | TEXT | When file was added to datalake |
| metadata | TEXT | Additional JSON metadata |
FTS5 virtual table for full-text search on transcripts.
Optimized indexes for common queries:
idx_audio_created_at: Fast queries by dateidx_audio_tags: Fast tag filteringidx_audio_format: Filter by audio formatidx_transcripts_created_at: Fast queries by dateidx_transcripts_audio_id: Join optimizationidx_transcripts_tags: Fast tag filteringidx_screenshots_created_at: Fast queries by dateidx_screenshots_tags: Fast tag filtering
The datalake runs a FastAPI server on port 8766 providing network access to the database.
Base URL: http://localhost:8766 (or laptop IP for network access)
Interactive Documentation:
- Swagger UI:
http://localhost:8766/docs - ReDoc:
http://localhost:8766/redoc
Core Endpoints:
# Health check
GET /health
# List audio files (with pagination and filtering)
GET /api/v1/audio?limit=10&offset=0&tags=meeting
# Get specific audio file
GET /api/v1/audio/{id}
# List transcripts
GET /api/v1/transcripts?limit=10&offset=0
# Get specific transcript
GET /api/v1/transcripts/{id}
# Search transcripts (FTS5 full-text search)
GET /api/v1/search/transcripts?q=search_term&limit=10
# Get database statistics
GET /api/v1/stats
# List screenshots
GET /api/v1/screenshots?limit=10&offset=0Example Usage:
# From desktop, access laptop database
curl http://100.103.8.87:8766/api/v1/audio | jq
# Search transcripts
curl "http://100.103.8.87:8766/api/v1/search/transcripts?q=meeting" | jq
# Get stats
curl http://100.103.8.87:8766/api/v1/stats | jqFrom Desktop to Laptop:
- Tailscale:
http://100.103.8.87:8766 - Local Network:
http://192.168.29.137:8766(find IP withip addr)
CORS: Enabled for all origins (suitable for local network use)
If set up via local-bootstrapping (auto-start enabled):
# Check status
systemctl --user status datalake.service
# Restart service
systemctl --user restart datalake.service
# Stop service
systemctl --user stop datalake.service
# Start service
systemctl --user start datalake.service
# View service logs
systemctl --user logs -f datalake.service
# View container logs
cd ~/Programs/datalake
docker-compose logs -f
# Disable auto-start
systemctl --user disable datalake.serviceManual Docker management:
# Start server
docker-compose up -d
# Stop server
docker-compose down
# Restart server
docker-compose restart
# View logs
docker-compose logs -f
# Rebuild and restart
docker-compose build && docker-compose up -dThe datalake includes a Flask-based web interface on port 5050 for browsing data and monitoring.
Automated (via systemd):
# Enable and start
systemctl --user enable --now datalake-web.service
# Check status
systemctl --user status datalake-web.service
# View logs
journalctl --user -u datalake-web.service -fManual:
cd ~/Programs/datalake
uv run python3 -m web.app --port 5050| URL | Description |
|---|---|
http://localhost:5050/ |
Home page with stats |
http://localhost:5050/sessions |
Browse Claude sessions |
http://localhost:5050/voice |
Voice typing sessions |
http://localhost:5050/chatgpt |
ChatGPT conversations |
http://localhost:5050/search |
Full-text search |
http://localhost:5050/memory |
Memory monitoring dashboard |
http://localhost:5050/memory/events |
Memory event timeline |
Monitor Claude Code memory usage with real-time charts and alerts.
The memory monitoring system collects RAM metrics from Claude processes and stores them in the datalake for visualization and analysis.
Data sources:
/var/log/claude-memory/metrics.jsonl- Time-series RAM data (every 10 seconds)/var/log/claude-memory/events.jsonl- Memory events (warnings, kills, restarts)
-- Memory metrics (time-series RAM data)
memory_metrics (
pid, session_id, rss_mb, memory_rate_mb_min,
timestamp, source_device
)
-- Memory events (warnings, kills, restarts)
memory_events (
event_type, pid, severity, message,
details, timestamp
)Memory Dashboard (/memory):
- Real-time RAM usage chart (per PID)
- Rate of change graph (MB/min)
- Active sessions list with current RAM
- Low-memory mode toggle per session
Event Timeline (/memory/events):
- Chronological event list
- Filter by event type
- Color-coded by severity
Flask Web UI (port 5050):
# Get chart data (for AJAX updates)
GET /api/memory/chart-data
# List active sessions
GET /api/memory/sessions
# Toggle low-memory mode for a session
POST /api/memory/sessions/{pid}/low-memory-mode
# Body: {"enabled": true}FastAPI REST (port 8766):
# Get today's metrics
GET /api/v1/memory/metrics/today
# Get metrics for date range
GET /api/v1/memory/metrics/range?start=2026-01-01&end=2026-01-14
# Get today's events
GET /api/v1/memory/events/today
# Get events for date range
GET /api/v1/memory/events/range?start=2026-01-01&end=2026-01-14
# List sessions
GET /api/v1/memory/sessions
# Toggle low-memory mode
POST /api/v1/memory/sessions/{pid}/low-memory-mode?enabled=trueFrom desktop (via SSH):
# One-time ingestion
./scripts/ingest-memory.sh --once
# Watch mode (continuous)
./scripts/ingest-memory.sh --watch
# Check connectivity
./scripts/ingest-memory.sh --checkDirect parser usage:
# Parse local log files
uv run python3 -m parsers.memory_parser --log-dir /var/log/claude-memory
# Parse from specific device
uv run python3 -m parsers.memory_parser --device desktop
# Test mode (dry run)
uv run python3 -m parsers.memory_parser --testWhen enabled for a session, creates a control file at:
/var/log/claude-memory/low-memory-mode/{pid}
The Claude memory monitoring hook reads this file and injects low-memory guidance.
- Bash shell
- SQLite3
- Python 3.13+ (for running tests)
uv(Python package manager)ffprobe(optional, for audio metadata extraction)
- Clone or navigate to the datalake directory:
cd ~/Programs/datalake- Initialize the database and directory structure:
./scripts/init.shThis will:
- Create the SQLite database from
schema.sql - Set up the data directory structure
- Set proper file permissions (755 for directories, 644 for files)
- Create log directories
# Basic ingestion
./scripts/ingest-audio.sh path/to/audio.wav
# With tags
./scripts/ingest-audio.sh recording.wav "meeting,work,important"The script will:
- Copy the file to
data/audio/YYYY/MM/DD/ - Extract metadata using
ffprobe(if available) - Insert a record into the SQLite database
- Log all operations
./scripts/query.shThis opens an interactive menu with options:
- List recent audio (last 10)
- List recent transcripts (last 10)
- List recent screenshots (last 10)
- Search transcripts (full-text)
- Show audio by tags
- Show statistics
- Custom SQL query
- Open SQLite shell
# List all audio files
sqlite3 datalake.db "SELECT filename, duration_seconds, tags FROM audio ORDER BY created_at DESC LIMIT 10;"
# Search transcripts
sqlite3 datalake.db "SELECT t.filename, snippet(transcripts_fts, 0, '>>>', '<<<', '...', 40) FROM transcripts t JOIN transcripts_fts ON t.id = transcripts_fts.rowid WHERE transcripts_fts MATCH 'search terms';"
# Get statistics
sqlite3 datalake.db "SELECT COUNT(*) as total, ROUND(SUM(duration_seconds)/60, 2) as minutes FROM audio;"The project includes a comprehensive test suite with 35+ tests.
# Run all tests
uv run pytest -v
# Run specific test file
uv run pytest tests/test_database.py -v
# Run with coverage
uv run pytest --cov=. --cov-report=htmlTest coverage includes:
- Database schema validation
- Audio ingestion workflow
- Query functionality
- Error handling
- File permissions
- Logging verification
- Full-text search
- Data integrity constraints
All operations are logged with structured timestamps:
[2026-01-10T12:00:00+05:30] [INFO] Starting ingestion of: audio.wav
[2026-01-10T12:00:00+05:30] [INFO] File copied successfully
[2026-01-10T12:00:00+05:30] [INFO] Record inserted successfully with ID: 42
Log files are stored in logs/:
init.log: Initialization operationsingest-audio.log: Audio ingestion operationsquery.log: Query operations
If you have existing data in ~/Programs/recordings/ and ~/Programs/transcripts/, you can migrate it:
# Initialize datalake first
./scripts/init.sh
# Migrate audio files
for file in ~/Programs/recordings/*.wav; do
./scripts/ingest-audio.sh "$file" "migrated,voice-typing"
done
# TODO: Create similar script for transcripts and screenshotsCreate a migration script:
#!/usr/bin/env bash
# migrate.sh - Migrate existing data to datalake
SOURCE_AUDIO=~/Programs/recordings
SOURCE_TRANSCRIPTS=~/Programs/transcripts
# Migrate audio
echo "Migrating audio files..."
for audio in "$SOURCE_AUDIO"/*.wav; do
[ -f "$audio" ] || continue
echo "Ingesting: $audio"
./scripts/ingest-audio.sh "$audio" "migrated"
done
# TODO: Add transcript and screenshot migration
echo "Migration complete!"Scripts support environment variables for flexibility:
DATA_DIR: Override data directory location (default:./data)DB_FILE: Override database file path (default:./datalake.db)LOG_DIR: Override log directory (default:./logs)PROJECT_ROOT: Override project root (default: script parent directory)SCHEMA_FILE: Override schema file path (default:./schema.sql)
Example:
# Use custom data directory on SSD
export DATA_DIR=/mnt/data-ssd/datalake
./scripts/init.sh- Docker Container: Containerized deployment with isolated environment ✅
- REST API: FastAPI wrapper for HTTP access (port 8766) ✅
- Web UI: Flask dashboard for browsing and searching (port 5050) ✅
- Memory Monitoring: Claude Code RAM tracking with charts ✅
- Vector Search: Add
sqlite-vssextension for semantic search - Screenshot Ingestion: Script for screenshot ingestion with metadata
- Transcript Ingestion: Script for transcript ingestion with word count
- Automatic Cleanup: Script for managing old files and log rotation
- Backup/Restore: Automated backup scripts with compression
- Loki Integration: Centralized logging with Grafana Loki
- S3 Backend: Optional S3-compatible storage backend
- Encryption: At-rest encryption for sensitive recordings
When making changes:
- Update relevant tests in
tests/ - Run the full test suite:
uv run pytest -v - Ensure all tests pass
- Update this README if adding features
- Follow the coding style in existing scripts
Personal project - see owner information in CLAUDE.md
- GitHub: prabhanshu11
- Email: mail.prabhanshu@gmail.com