Skip to content

slzatz/photo-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

photo-search

A command-line tool for content-based photo search over Google Drive. Instead of relying on filenames or manual tagging, photo-search uses AI vision models to understand what's in each photo, then builds a searchable index that supports both keyword and semantic queries.

How It Works

photo-search operates in two phases: indexing and searching. During indexing, photos are downloaded from Google Drive, analyzed by AI models to produce text descriptions, and embedded into a vector space alongside the image data. During searching, a user's natural-language query is matched against the index using text search, semantic similarity, or both.

The Indexing Pipeline

Indexing is the process of building a local database of photo metadata, AI-generated descriptions, and vector embeddings. The pipeline processes each photo through several stages:

1. Discovery and Download

The indexer connects to Google Drive using gog (a Google Drive CLI tool) and lists folders at the Drive root. By default, it looks for folders whose names match a set of known patterns (meadow, pool, garden, stream, snow, lily, pre-meadow), though you can target a specific folder with the --folder option. Within each folder, it enumerates all .jpg, .jpeg, and .png files, handling pagination for large folders (up to 500 photos per folder).

Each photo is downloaded to a temporary directory (/tmp/photo_search_tmp) for processing and deleted after indexing is complete.

Incremental indexing: The indexer is incremental by default. It tracks which photos have already been indexed and their last-modified timestamps from Google Drive. When you add new photos to a Drive folder and re-run photo-search index, only the new or modified photos go through the captioning and embedding pipeline -- previously indexed photos are skipped entirely. This makes re-indexing efficient, since the expensive API calls (Claude vision, Gemini embeddings) are only made for photos the system hasn't seen before. The --full flag overrides this behavior and re-processes everything from scratch.

2. Generating Text Descriptions with Claude Vision

Each downloaded photo is sent to Claude Haiku (claude-haiku-4-5) via the Anthropic API for visual identification. The prompt asks Claude to:

  • Identify the subject in 1-2 sentences, with emphasis on species identification for plants and flowers (both common and scientific names)
  • Describe key visual features (color, shape, growth habit)
  • Note the setting (meadow, pond, garden, etc.)
  • Produce 5-10 comma-separated keyword tags

The result is parsed into two fields: a caption (the descriptive sentences) and tags (the comma-separated keywords). For example, a photo might produce:

Caption: A cluster of bright orange butterfly weed (Asclepias tuberosa) in full bloom, growing in a sunny meadow with native grasses in the background.

Tags: butterfly weed, asclepias tuberosa, orange flowers, wildflower, meadow, native plant, milkweed, summer

If a photo exceeds the API size limit (~3.75 MB before base64 encoding), ImageMagick is used to resize it down, first to 2048x2048 at quality 80, then to 1600x1600 at quality 70 if still too large.

3. Creating Multimodal Embeddings with Gemini

This is the key step that makes semantic search possible. The indexer uses Gemini's embedding model (gemini-embedding-2-preview) to create a 3072-dimensional vector embedding for each photo. Critically, the embedding is not generated from the image alone or the text alone -- it is a combined multimodal embedding that fuses both the raw image data and the Claude-generated text description into a single vector.

The embed_photo function constructs a multipart request to the Gemini embedding API containing two parts:

  1. The image itself -- the raw photo bytes sent as an inline blob with the appropriate MIME type
  2. The text description -- the caption produced by Claude in the previous step

Gemini processes both parts together and returns a single embedding vector that captures the semantic meaning of the visual content and the textual description simultaneously. This combined approach is more powerful than embedding either modality alone:

  • The image contributes visual features that text can't fully describe (color palettes, spatial composition, lighting)
  • The text description contributes precise identification (species names, setting descriptions) and structured tags that the image embedding alone might not capture

The resulting 3072-dimensional float32 vector is serialized to bytes and stored directly in the SQLite database as a blob.

4. Storage in SQLite

All indexed data is stored in a single local SQLite database (photo_index.db). This one database serves both search modes: it holds the text data that powers FTS5 keyword search and the embedding vectors that power semantic similarity search. Each photo record in the photos table includes:

  • Google Drive file ID, filename, folder name, and folder ID
  • The AI-generated caption and tags (stored as text columns, used for FTS5 keyword search)
  • The multimodal embedding (stored as a raw float32 blob in the same row, loaded into numpy at search time for cosine similarity computation)
  • The embedding model identifier (recorded as gemini-embedding-2-preview+claude to reflect the combined pipeline)
  • Timestamps for when it was indexed and when it was last modified on Drive

The database also maintains an FTS5 full-text search index -- a virtual table over the caption, tags, filename, and folder name columns. SQLite triggers automatically keep the FTS index in sync whenever photos are inserted, updated, or deleted. This means a single photo-search index run populates everything needed for both text and embedding search in one pass.

The Search Process

When you run a search query, photo-search can find relevant photos in three ways:

Text Search (FTS5)

Text search uses SQLite's FTS5 engine to perform keyword matching against the caption, tags, filename, and folder name of every indexed photo. The user's query is split into individual terms, and each term is matched using OR logic -- a photo matches if any of the query terms appear in any of the indexed text fields. Results are ranked by FTS5's built-in BM25 relevance scoring.

This mode is fast and works well for exact-match queries like specific species names or tags.

photo-search search "ironweed" --mode text

Embedding Search (Semantic Similarity)

Embedding search finds photos that are semantically similar to the query, even if the exact words don't appear in the caption or tags.

The process works as follows:

  1. The user's query text is sent to the same Gemini embedding model (gemini-embedding-2-preview) that was used during indexing, producing a 3072-dimensional query vector.
  2. All stored photo embeddings are loaded from the SQLite database into Python memory.
  3. Cosine similarity is computed in Python using numpy between the query vector and every photo's embedding vector -- this is a brute-force linear scan, not a database-level vector index.
  4. Results are sorted by similarity score (highest first) and the top matches are returned.

It's important to note that while SQLite stores the embedding vectors as blobs, it does not perform the similarity computation. Unlike text search where FTS5 handles the matching and ranking entirely within SQLite, embedding search uses SQLite only as storage. The vectors are loaded into Python and compared using numpy's dot product and norm operations. This brute-force approach is simple and works well for collections of hundreds to low thousands of photos. For significantly larger collections, the sqlite-vec extension could add approximate nearest neighbor indexing directly within SQLite, but it is not needed at the current scale.

Because the photo embeddings were created from a combination of image data and text description, and the query embedding is created from text using the same model, the Gemini embedding model acts as a bridge between the two modalities. A query like "tall purple wildflower in a meadow" will match photos whose combined visual+textual embedding is close in the shared vector space, even if those exact words don't appear in the stored caption.

photo-search search "tall purple wildflower in a meadow" --mode embedding

Combined Search (Default)

The default search mode runs embedding search first to find the most semantically relevant results. If the embedding search returns fewer results than the requested limit, it supplements with text search results (deduplicating by Drive file ID). This gives you the best of both approaches: semantic understanding with keyword fallback.

photo-search search "purple coneflower near the pond"

Prerequisites

  • Python 3.11+
  • gog -- a Google Drive CLI tool, installed separately and configured with a Google account
  • API keys for both Anthropic (Claude) and Google (Gemini), stored in environment variables or in ~/.secrets
  • ImageMagick (optional) -- used to resize oversized images before sending to the Claude API

Required Environment Variables

Variable Purpose
ANTHROPIC_API_KEY Claude API access for vision captioning
GEMINI_API_KEY Gemini API access for embeddings
GOG_PERSONAL_ACCOUNT Google account identifier used by gog

These can be set in ~/.secrets or ~/claude-sessions/.claude/accounts.env -- the tool auto-loads both files on startup.

Installation

python -m venv .venv
source .venv/bin/activate
pip install -e .

Usage

Index your photos

# Index all photos in recognized folders (incremental)
photo-search index

# Index only folders matching a keyword
photo-search index --folder meadow

# Force re-index everything
photo-search index --full

# Quiet mode (JSON stats output only)
photo-search index --quiet

Search

# Default combined search
photo-search search "purple wildflower"

# Limit results
photo-search search "butterfly weed" --limit 10

# Text-only search (fast, keyword matching)
photo-search search "asclepias" --mode text

# Embedding-only search (semantic similarity)
photo-search search "orange flowers in a sunny field" --mode embedding

# JSON output
photo-search search "goldenrod" --json

Other Commands

# View index statistics
photo-search stats

# Re-generate caption for a specific photo
photo-search recaption <drive_file_id>

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages