Skip to content

Latest commit

 

History

History
212 lines (156 loc) · 8.69 KB

File metadata and controls

212 lines (156 loc) · 8.69 KB

Sentence Transformer Problem Areas Analysis

Research Documentation for Academic Publication

Generated from experimental analysis using sentence-transformers/all-MiniLM-L6-v2


Executive Summary

This analysis demonstrates three critical limitations in sentence transformer models when using cosine similarity for semantic understanding. The experiments reveal fundamental issues that affect real-world applications including search systems, recommendation engines, and semantic matching tasks.

Key Findings:

  • Paraphrase variants maintain similar pairwise cosine similarity but produce inconsistent neighbor rankings
  • Semantically unrelated pairs can exhibit identical cosine similarities
  • Mathematical transformations can preserve some relationships while destroying others unpredictably

Case #2: Paraphrase Variants - Neighbor Ranking Changes

Problem Statement

Semantically equivalent paraphrased sentences maintain similar pairwise cosine similarity but show different neighbor rankings in a fixed corpus, affecting practical search results.

Experimental Setup

  • Model: sentence-transformers/all-MiniLM-L6-v2
  • Corpus Size: 31 documents across 6 domains (safety, finance, sports, AI, commerce, health)
  • Top-K Neighbors: 10
  • Evaluation Metrics: Jaccard similarity, Rank-Biased Overlap (RBO)

Test Cases

Original Pair:

  • A1: "Safety fences reduce the risk of worker injury around machines."
  • A2: "Protective barriers shield employees from accidents near industrial robots."

Paraphrased Pair:

  • A1': "By installing safety fences around machines, the risk of worker injury is reduced."
  • A2': "Employees are protected from accidents around industrial robots by protective barriers."

Results Summary

Metric Original Pair Paraphrased Pair Difference
Pairwise Cosine Similarity 0.6485 0.6107 0.0377
Jaccard Overlap (A1 vs A1') - - 1.000
Jaccard Overlap (A2 vs A2') - - 0.818
RBO Score (A1 vs A1') - - 0.949
RBO Score (A2 vs A2') - - 0.990

Sample Neighbor Rankings

Top 5 Neighbors for A1 (Original):

  1. [1.000] Safety fences reduce the risk of worker injury around machines...
  2. [0.578] Protective barriers around robot cells prevent accidents...
  3. [0.464] OSHA 1910.212 requires machine guarding to protect operators...
  4. [0.307] Light curtains can stop hazardous motion when a person enters...
  5. [0.271] CE marking indicates conformity with health and safety standards...

Top 5 Neighbors for A1' (Paraphrased):

  1. [0.947] Safety fences reduce the risk of worker injury around machines...
  2. [0.529] Protective barriers around robot cells prevent accidents...
  3. [0.460] OSHA 1910.212 requires machine guarding to protect operators...
  4. [0.286] Light curtains can stop hazardous motion when a person enters...
  5. [0.250] Regular exercise reduces the risk of cardiovascular disease...

Research Implications

  • Search Inconsistency: Equivalent queries produce different result rankings
  • User Experience Impact: Paraphrased searches yield inconsistent results
  • System Reliability: Semantic search systems show unstable behavior

Case #4: Different Pairs with Same Cosine Similarity

Problem Statement

Semantically unrelated pairs can exhibit identical cosine similarities, demonstrating that cosine similarity alone is insufficient for semantic understanding.

Experimental Setup

  • Candidate Pool: 30 diverse phrases (words, short sentences)
  • Total Comparisons: 435 pairwise similarities computed
  • Cosine Tolerance: ±0.02
  • Lexical Overlap Filter: <0.3 to ensure semantic differences

Results Summary

  • Total Similar Pairs Found: 7,307 pairs with approximately identical cosine similarities
  • Semantic Diversity: Pairs span completely different domains and concepts
  • Lexical Independence: Low lexical overlap confirms semantic differences

Sample Results (3 Random Examples)

Example 1:

  • Pair A: 'automobile' ↔ 'An eagle soared over the valley.'
  • Pair B: 'couch' ↔ 'A sparrow perched on the fence.'
  • Cosine A: 0.0642
  • Cosine B: 0.0642
  • Difference: 0.0000
  • Max Lexical Overlap: 0.000

Example 2:

  • Pair A: 'couch' ↔ 'The dog barked at the mailman.'
  • Pair B: 'He tightened the screw with a screwdriver.' ↔ 'The physician assessed the symptoms.'
  • Cosine A: 0.0015
  • Cosine B: 0.0015
  • Difference: 0.0000
  • Max Lexical Overlap: 0.100

Example 3:

  • Pair A: 'An eagle soared over the valley.' ↔ 'The automobile was parked in the garage.'
  • Pair B: 'The physician assessed the symptoms.' ↔ 'He was joyful during the celebration.'
  • Cosine A: 0.0448
  • Cosine B: 0.0448
  • Difference: 0.0000
  • Max Lexical Overlap: 0.111

Research Implications

  • Semantic Ambiguity: Cosine similarity fails to distinguish semantic relationships
  • False Positives: Unrelated content appears semantically similar
  • Evaluation Limitations: Single-metric evaluation insufficient for semantic tasks

Case #6: Diagonal Scaling with Compensation

Problem Statement

Mathematical transformations can preserve query-document dot-product rankings while significantly altering document-document cosine similarities, revealing instability in similarity measures.

Experimental Setup

  • Queries: 4 diverse search queries
  • Documents: 8 documents across multiple domains
  • Transformation: Diagonal scaling with compensation (U→U·D, V→V·D⁻¹)
  • Embedding Dimension: 384
  • Scaling Range: [0.673, 1.480]

Results Summary

Metric Value
Query→Document Rankings Preserved True (100%)
Top-K Overlap (All Queries) 8/8 (Perfect)
Document-Document Cosine Change (Frobenius Norm) 0.1687
Maximum Cosine Similarity Change 0.0426
Mean Cosine Similarity Change 0.0166

Sample Query Results

Query 1: "what are machine safety fences used for"

  • Top Result: [0.814] Safety fences reduce the risk of worker injury around machines...
  • Ranking Stability: Perfect preservation across transformation

Query 2: "how to protect employees near industrial robots"

  • Top Result: [0.547] Safety fences reduce the risk of worker injury around machines...
  • Second Result: [0.531] Protective barriers around robot cells prevent accidents...

Query 4: "vector search and semantic similarity"

  • Top Result: [0.732] Vector databases accelerate semantic search over large corpora...
  • Second Result: [0.429] Contrastive learning aligns similar sentences in embedding space...

Research Implications

  • Ranking Fragility: Document relationships unstable under transformations
  • Evaluation Robustness: Need for transformation-invariant evaluation metrics
  • System Reliability: Embedding modifications can have unpredictable effects

Overall Research Conclusions

Critical Limitations Identified

  1. Paraphrase Sensitivity

    • Semantically equivalent expressions produce inconsistent neighbor rankings
    • Impact: Unreliable search and retrieval systems
  2. Cosine Ambiguity

    • Unrelated semantic pairs exhibit identical similarity scores
    • Impact: False semantic matches in applications
  3. Transformation Instability

    • Mathematical operations preserve some relationships while destroying others
    • Impact: Fragile similarity measures under system modifications

Recommendations for Future Research

  1. Multi-Metric Evaluation

    • Combine cosine similarity with additional semantic measures
    • Develop robust evaluation frameworks beyond pairwise comparisons
  2. Paraphrase Robustness Testing

    • Systematic evaluation across paraphrase variants
    • Development of paraphrase-invariant similarity measures
  3. Transformation Stability Analysis

    • Test embedding stability under various mathematical transformations
    • Design transformation-robust similarity metrics
  4. Semantic Evaluation Beyond Cosine

    • Explore alternative similarity measures (e.g., Earth Mover's Distance, Wasserstein)
    • Investigate contextual and compositional similarity approaches

Technical Specifications

  • Model: sentence-transformers/all-MiniLM-L6-v2
  • Framework: Sentence Transformers, scikit-learn
  • Evaluation Metrics: Cosine similarity, Jaccard index, Rank-Biased Overlap
  • Random Seed: 42 (for reproducibility)
  • Corpus Domains: Safety/Industry, Finance, Sports, Web/AI, Commerce, Health

This analysis provides empirical evidence for fundamental limitations in current sentence transformer approaches to semantic similarity, offering concrete examples for academic research and system development considerations.