WineTz is a powerful tool that leverages sentiment analysis techniques to provide insights into the emotional tone of wine reviews.
Whether you're a wine enthusiast, a sommelier, or a data-driven marketer, our engine helps you navigate the rich landscape of wine opinions.
All requirements are described in requiriments.txt
- Python 3.11.1
- nltk 3.8.1
- pandas 2.1.4
- Requests 2.31.0
- tqdm 4.66.1
- transformers 4.34.1
- Whoosh 2.7.4
- langdetect 1.0.9
pip3 install <module>
In this section: general overview of wineTz, main tools implemented.
crawler/crawler.py
When WineTz starts its tasks, it prints the number of matches obtained through requests to the vivino.com API. Afterwards, you will see a progress bar describing the progress of the review retrieval.
WineTz creates an output folder /out. Inside /out create a directory for each exported dataset.
Inside the dataset directory, WineTz exports three .csv files: wines, style and reviews:
wine.csv contains information about wines
style.csv provides information on wine styles
reviews.csv the reviews of each wine.
Automatically, a fourth .json file is created: parameters.json.
This file contains the parameters used for scraping. By copying this file to the crawler/input/, you can scrape with the same search parameters.
detailed information about web crawler
indexer/indexer.py
After downloading the data, wineTz has a suite to index the data, and calculate sentiment analysis measures.
It's possible to download analysis models, and change the relative paths in the sentimentAnalysis.py module. An index folder will be created inside the main project directory.
searcher/searcher.py
wineTz has a graphical interface to propose queries and perform searches.
Various functions are implemented within the GUI, including: search filters, complex natural language queries, search expansions using thesaurus and proofing tools.
-
Web Scraping from vivino.com {#crawling}
- The crawler, located in the
/crawlerdirectory, is responsible for downloading data from vivino.com. The details of its functioning are defined in the README.md file within the corresponding directory. - After defining the search parameters, the crawler downloads the data into the
out/datasetfolder with corresponding timestamps.
- The crawler, located in the
-
Data Representation with dataset.json
- The
/datafolder contains a script nameddataLoad.py, which works with two directories:/data/inputand/data/dataset. indexer.pyoperates with a file nameddataset.json, which has a specific structure (example injsonStructure.json).- The
dataset.jsonfile is created by running thedataLoad.pyscript. To createdataset.json:- Run dataLoad.py:
python3 dataLoad.py
- The script will ask for the input folder path for the .csv files:
Pressing enter without typing the path loads .csv files automatically from the
Type path directory of .csv from crawler or other>/data/inputfolder. For convenience, copy the contents of the data directory from the crawler into the/data/inputfolder. - Two files will be created inside
/data/dataset/:dataset.jsonanddataset.csv.
- Run dataLoad.py:
dataLoad.pycreates a folder at the path/data/dataset/archiveand creates two copies of the dataset: the first inside the archive, the second inside/data/dataset/dataLoad.pyis used to obtain an organized and clean file without redundancies.
- The
-
Indexing Data with Sentiment Analysis
- The
/indexer/indexer.pyscript creates the index starting from thedataset.jsonfile:python3 /indexer/indexer.py
- The script accepts parameters using the
argsparselibrary:this parameter is used to index data via qgrams.python3 /indexer/indexer.py -q
- The script accepts
This parameter is used to employ
python3 /indexer/indexer.py -o
indexer.pywith offline sentiment analysis models.
- The
-
Navigate indexed data via graphical interface
- The
/searcher/searcher.pyloads GUI with function to navigate into items indexed collection:python3 /searcher/searcher.py
/searcher/searcher.pyit is based on the code implemented in the other two files/searcher/searcherIO.pyand/searcher/textTools.py. - Automatically, during the startup phase, the GUI tries to load the index at the
/../indexpath./../indexpath where by default indexer.py creates the index of the dataset. - Via GUI it is possible to flag a set of parameters and values, including different types of wines, price ranges, search fields, etc.
- The
-
Conduct search engine benchmarks
- The
/benchmark/benchmarks.pyit is useful for proposing different queries, specifying the search parameters, and for specifying a relevance value for each query. - The value entered will then be used to obtain DCG and NDCG measurements.
python3 /benchmark/benchmarks.py
- Currently, some queries are already described. To change them, be careful when entering the parameters.
- The
Note:
Currently, the textual procedures are activated, tested and working exclusively for the Italian language.
In the modules, the code is prepared for easy loading of other templates for other languages.
Technologies and libraries:
- Python 3.11.1
Sentiment Analysis Models:
- IT classifier: MilaNLProc/feel-it-italian-emotion [https://huggingface.co/MilaNLProc/feel-it-italian-emotion]
- EN classifier: cardiffnlp/twitter-roberta-base-sentiment-latest [https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest]
- @piltxi excellent singer after some wine and amateur developer
