Code collection for comparative analysis of text similarity algorithms for detecting near duplicates in JavaDoc software documentation.
similarity: similarity algorithms.helpers: functions for dataset generation and parsing, caching and similarity algorithm exhaustion.datasets: source project files and generated datasets.cache: cached similarities for the test dataset.pipeline: segmentation and normalization functions.metrics: functions for algorithm evaluation.stats: evaluation results.visualization: functions for result visualization.
- Cosine distance (via term frequency vectors)
- Longest common subsequence (token-wise)
- Levenshtein distance
- Locality sensitive hashing
- Siamese neural network (via wor2vec embeddings and LSTM)
- Word mover's distance (via word2vec)
Run one of the functions from the metrics folder, providing it with necessary datasets.