Skip to content

Saru2003/SubZam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Shazam for Subtitles

A layered subtitle retrieval system in Go that handles noisy user input—typos, phonetic errors, and paraphrasing—by combining SimHash, phonetic hashing, and semantic embeddings.


Features

  • SimHash for quick fuzzy matching on subtitle chunks.
  • Phonetic hashing (Double Metaphone + Levenshtein) for sound-alike queries.
  • Semantic search via OpenAI's Ada embeddings stored in PostgreSQL with pgvector.
  • Sliding window chunking to preserve full context across multi-line quotes.
  • Built with Go, Redis, and PostgreSQL.
  • Tuned for performance today; scalable with LSH, BK-trees, or FAISS for future growth.

Learn More

Read the full technical write-up and see visuals, trade-offs, and real examples: [https://medium.com/@sarvesh20123/building-a-robust-subtitle-search-system-with-simhash-phonetic-hashing-and-embeddings-67437e5864b1]

Tech Stack

  • Language: Go
  • In-memory index: Redis
  • Semantic storage: PostgreSQL + pgvector
  • Embeddings: OpenAI text-embedding-ada-002

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages