A robust Python-based tool for organizing and maintaining a digital file vault with intelligent file processing, metadata extraction, and consistent naming conventions.
-
Intelligent File Processing
- Specialized processors for images, videos, PDFs, and text files
- Automatic file type detection and categorization
- Consistent file naming across all formats
- Metadata extraction and preservation
- Duplicate detection using XXH64 checksums
-
Hierarchical Organization
- Configurable date-based directory structure (none/year/month/day)
- Type-based categorization (images, videos, documents, notes)
- Smart subcategorization (e.g., scanned vs regular documents)
-
Metadata Management
- EXIF data extraction from images
- Video technical metadata (resolution, fps, HDR)
- PDF metadata and document type detection
- Text file analysis (word count, frontmatter parsing)
- Creation dates and timestamps preservation
-
Database Integration
- SQLite database for file tracking
- Checksum-based duplicate prevention
- Quick file lookup and metadata querying
- Clone this repository
- Create a virtual environment:
python -m venv venv - Activate the virtual environment:
- On Unix/MacOS:
source venv/bin/activate - On Windows:
.\venv\Scripts\activate
- On Unix/MacOS:
- Install dependencies:
pip install -r requirements.txt
- Python 3.8 or higher
- External dependencies:
exiftoolfor image metadata extractionffprobe(part of ffmpeg) for video metadata extraction
Edit config.py to customize:
- Input directory (
INBOX_DIR) - Vault directory (
VAULT_DIR) - Date hierarchy level:
DATE_HIERARCHY_NONE: flat structureDATE_HIERARCHY_YEAR: year foldersDATE_HIERARCHY_MONTH: year/month foldersDATE_HIERARCHY_DAY: year/month/day folders
- Regular formats (stored in
images/): jpg, jpeg, png, gif, bmp, webp, tiff, tif - RAW formats (stored in
photos/raw/): heic, arw (Sony), cr2 (Canon), nef (Nikon), raf (Fuji), dng, raw
Naming format:
d[YYYYMMDD]-t[HHMMSS]-s[BYTES]-r[WIDTHxHEIGHT][-sc######][-MAKE-MODEL].ext
- Formats: mp4, mov, avi, mkv, wmv, flv, webm, mpg, mpeg, mts, m2ts, m4v, 3gp
Naming format:
d[YYYYMMDD]-t[HHMMSS]-s[BYTES]-[RESOLUTION]-[FPS]-[DURATION][-HDR].ext
- Categories: ebooks, scanned documents, regular PDFs
Naming format:
[scanned-]d[YYYYMMDD]-t[HHMMSS]-s[BYTES]-p[PAGES][-TITLE/SCANNER].pdf
- Formats: txt, md
- Support for YAML frontmatter in markdown
- Automatic title extraction from frontmatter or content
Naming format:
d[YYYYMMDD]-t[HHMMSS]-s[BYTES]-w[WORDCOUNT][-TITLE].ext
# Process new files in inbox
python vault_builder.py
# Clean up inbox (remove empty dirs and hidden files)
python inbox_cleaner.pyFiles are organized into these main directories:
images/: Regular image files (jpg, png, etc.)photos/raw/: RAW photo files (arw, cr2, nef, etc.)videos/: All video filesdocuments/: PDFs and other documentsdocuments/scanned/: Scanned documentsdocuments/ebooks/: Detected ebooks
notes/: Text and markdown files
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request