Project work for Natural Language Processing and Text Mining course 2024.
This project aims to investigate the similarity between two phrases using internet search results only.
WebJaccard similarity is a measure used in natural language processing (NLP) to quantify the similarity between two sets of web search results. It is based on the Jaccard similarity coefficient, which is a statistical measure of the similarity between two sets. The WebJaccard similarity specifically leverages web search engines to determine the overlap between the sets of web pages returned for two different queries.
API_KEYS=key1,key2,key3 (List of API keys for the Google Custom Search API)
CX=your_google_cx (The identifier of the Programmable Search Engine)
conda env create --file environment.yml
conda activate nlp-project
conda env update --file environment.yml
Git Bash:
python -m venv .venv
source .venv/Scripts/activate
pip install -r requirements.txt
PowerShell (Administrator):
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python main.py
datasets/ directory contains three standard human judgments datasets used in the project to calculate the similarity between two phrases.
Pearson correlation coefficient is used to evaluate the similarity between WebJaccard similarity and human similarity judgments.
The following datasets have human judgments similarity scores ranging between 0 and 10:
- MC (Miller and Charles, 91) - 30 pairs of terms
- RG (Rubenstein and Goodenougth, 1965) - 65 pairs of terms
- WordSim353 (Finkelstein et al., 2001) - 353 pairs of terms
- Ensure each similarity score is normalized to fall between 0 and 1.
- Use specific normalizations for cosine and other embedding-based similarities.
- Check if path-based measures like Wu-Palmer are already within [0, 1], and rescale if they’re not.
https://github.com/alexanderpanchenko/sim-eval
https://console.cloud.google.com/apis/api/customsearch.googleapis.com
https://programmablesearchengine.google.com/controlpanel/create