Skip to content

a 2kB text deduplication and similarity detection finder

License

MIT and 2 other licenses found

Licenses found

MIT
LICENSE
Apache-2.0
LICENSE_APACHE
MIT
LICENSE_MIT
Notifications You must be signed in to change notification settings

ppmpreetham/shingles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

shingles

Fast, privacy-aware client-side near-duplicate detection in the browser
A lightweight Rust library compiled to WebAssembly (via wasm-bindgen) for computing MinHash signatures and Locality-Sensitive Hashing (LSH) indices to find similar/duplicate texts entirely in the browser, with optional salting for privacy-preserving similarity checks.

No server needed. Runs offline. Perfect for PWAs, note-taking apps, plagiarism checkers, private document comparison, or client-side data cleaning.

npm License: MIT

✨ Features

  • MinHash signatures for estimating Jaccard similarity between texts
  • Locality-Sensitive Hashing (LSH) with banding for fast candidate duplicate detection
  • Privacy-preserving mode: Optional per-document salt for blind similarity comparison (same salt → comparable signatures :-> different salt → no leakage)
  • Efficient shingling (k-grams on bytes, UTF-8 safe)
  • Zero-copy slices where possible for fast WASM ↔ JS interop
  • Small footprint WASM binary stays tiny (~500KB–1MB gzipped)
  • Offline-first works in browsers, PWAs, even low-end devices

Installation

Via npm (recommended for web projects)

npm install shingles
# or
yarn add shingles
# or
pnpm add shingles

Usage (JavaScript / TypeScript)

shingles is designed to be used in a background thread (Web Worker) for large datasets, but the API is simple enough for main-thread use on smaller collections.

import init, { Deduplicator } from 'shingles';

async function main() {
  await init(); 

  // 128 hashes divided into 16 bands (Threshold ~0.7-0.8)
  // use a BigInt for the seed (n suffix required in JS)
  const seed = 42n;
  const dedup = new Deduplicator(128, 16, seed); 

  const privacySalt = "user-secret-key-123";

  // add documents
  const docs = [
    { id: 1, text: "The quick brown fox jumps over the lazy dog." },
    { id: 2, text: "The quick brown fox leaps over the lazy dog!" }, // near dupe
    { id: 3, text: "I love coding in Rust and WebAssembly." }        // different
  ];

  for (const doc of docs) {
    const shingless = Deduplicator.get_shingless(doc.text, 5);
    const signature = dedup.compute_minhash(shingless, privacySalt);
    dedup.add_document(doc.id, signature);
  }

  // query for near-duplicates
  const queryText = "A quick brown fox jumps over a lazy dog.";
  const queryshingless = Deduplicator.get_shingless(queryText, 5);
  const querySig = dedup.compute_minhash(queryshingless, privacySalt);

  // LSH finds candidates in O(1) time instead of O(N)
  const candidates = dedup.query_candidates(querySig);
  console.log("Candidate IDs:", candidates); // Output: [1, 2]

  // fine-grained similarity check
  // compare the query against the most likely candidate (ID: 1)
  const sig1 = dedup.compute_minhash(Deduplicator.get_shingless(docs[0].text, 5), privacySalt);
  const similarity = dedup.estimate_jaccard(querySig, sig1);
  
  console.log(`Estimated Similarity: ${(similarity * 100).toFixed(2)}%`);
}

main();

Why Privacy Salt?

When two parties use the same secret salt, their MinHash signatures become comparable without revealing the original text.
Different salts produce completely incomparable signatures → ideal for secure, private similarity checks (e.g., two users compare notes without sending plaintext).

Performance

  • Tested on 1,000–5,000 short documents: signature computation + indexing < 500ms (browser, mid-range device)
  • Scales to 10k+ docs with Web Workers (recommended for heavy use)
  • Much faster than pure JS alternatives for large texts

Tuning the "Sensitivity" (S-Curve)

The relationship between Bands, Rows, and Similarity follows an S-curve. By adjusting the parameters in new Deduplicator(hashes, bands, seed), you control how "fuzzy" the matching is.

Goal Hashes Bands Threshold (T)
Strict 128 8 ≈ 0.84
Balanced 128 16 ≈ 0.70
Permissive 128 32 ≈ 0.55

Formula: $$T≈(1/b)1/r$$ where b is bands and r is rows per band

Building from Source

# Install wasm-pack if needed
cargo install wasm-pack

# Build for web target
wasm-pack build --target web

Roadmap

  • Web Worker helpers for non-blocking heavy operations
  • Exact similarity (Jaccard on shingles) only on candidates
  • Streaming/large-text support
  • Advanced PSI extensions for stronger privacy
  • TypeScript types & better docs

Contributing

Contributions welcome! Feel free to open issues or PRs.

  1. Fork the repo
  2. cargo build / wasm-pack build
  3. Make changes
  4. Submit PR

License

MIT © Preetham Pemmasani


If you use this in a project, star the repo and let me know — I'd love to hear about it!
Follow me on X: @ppmpreetham

About

a 2kB text deduplication and similarity detection finder

Topics

Resources

License

MIT and 2 other licenses found

Licenses found

MIT
LICENSE
Apache-2.0
LICENSE_APACHE
MIT
LICENSE_MIT

Stars

Watchers

Forks

Languages