This project investigates the key drivers of perceived wine quality using consumer-generated data from Vivino. A machine learning model was developed to predict red wine ratings, complemented by Tableau analysis to uncover patterns in pricing, geography, and perceived value. The project delivers data-driven insights into how consumers evaluate wine and highlights opportunities for pricing and product positioning.
βββ data/
β βββ clean/
β β βββ red_ml_final_valuescore.csv
β β βββ red_ml_final.csv
β β βββ red_ml_model_input.csv
β βββ raw/
β βββ Red.csv
β βββ Varieties.csv
β
βββ figures/
β βββ correlation matrix.png
β βββ Model Performance by Variable.png
β βββ ModelComparison.png
β βββ other models prediction.png
β βββ Pearson.png
β βββ Python_Best-Value-Countries.png
β βββ Python_PricevsRating-Log-Scale.png
β βββ Python_PricevsRating.png
β βββ Python_Top-Drivers-High-Wine-Ratings.png
β βββ scatterplot.png
β βββ Tableau-Dashboard_Market & Behavior.png
β βββ Tableau-Dashboard_Value Analysis.png
β βββ Tableau-Sheet_Best Value Wine Countries.png
β βββ Tableau-Sheet_Global Distribution of Red Wines.png
β βββ Tableau-Sheet_Price vs Rating.png
β βββ Tableau-Story_Exploring Price, Popularity, and Value.png
β βββ ...
β
βββ notebooks/
β βββ data_cleaning_beatriz_update.ipynb
β βββ data_modelling_victoria.ipynb
β βββ data-model_exploring.ipynb
β βββ insights-storytelling_rachel.ipynb
β
βββ slides/
β
βββ src/
β βββ cleaning-functions_beatriz.py
β βββ insights-functions_rachel.py
β βββ model-functions_victoria.py
β
βββ .gitattributes
βββ .gitignore
βββ .python-version
βββ config.yaml
βββ pyproject.toml
βββ README.md
βββ requirements.txt
βββ uv.lock
Can an algorithm predict how good a wine tastes? That was the core question behind this project.
Using the Vivino Red Wine dataset from Kaggle, we built a machine learning model to predict red wine ratings based on features like price, country, grape variety, and wine age. We limited our scope to red wines only to keep the business question focused: what actually drives a high Vivino rating?
Beyond the model, we explored patterns in the data using Tableau β looking at how country, winery, and variety relate to perceived quality and value.
This project explores what drives red wine ratings using data from Vivino, one of the world's largest wine platforms. Founded in Denmark in 2010, Vivino enables users to scan wine labels, rate wines, and access reviews from a global community. With over 60 million users and millions of ratings, it represents a large-scale, consumer-driven dataset. Unlike traditional wine evaluation systems, this analysis is based on real user behaviour rather than expert opinion.
The objective is to understand how price, geography, and popularity influence wine ratings β and to identify where consumers can find the best value.
Wine quality perception has been shaped over centuries, particularly by European producers. Countries like France played a central role in defining what "premium wine" means, with regions such as Bordeaux, Burgundy, and Champagne becoming global benchmarks β supported by concepts like terroir and controlled appellations. The 1855 Bordeaux Classification, commissioned under Napoleon III, formalised a hierarchy of prestige still referenced today.
As a result, perceptions of wine quality are not purely objective. They are influenced by historical, cultural, and market-driven factors β and platforms like Vivino inherit that context.
With the rise of consumer rating platforms, the power to evaluate wine has shifted from institutions to the public. But this introduces its own dynamics:
- Wine quality is subjective, but consumer behaviour is measurable
- Ratings influence purchasing decisions
- Ratings also shape how products are positioned in the market
- Existing perceptions can influence ratings themselves
This raises a key question: are we measuring true quality, or reinforcing historical reputation?
- What makes a red wine highly rated on Vivino?
- Does price strongly correlate with rating?
- Which regions consistently offer better value?
- Which features most influence wine quality?
- Where do mismatches happen β expensive but low-rated wines?
Source: Kaggle β Wine Rating and Price
We used only red.csv to keep the scope clear and the model focused. varieties.csv was used as a reference for grape name matching but not merged directly, as it lacked reliable join keys.
| File | Description |
|---|---|
red.csv |
Main dataset: name, winery, country, region, price, rating, year, number of ratings |
varieties.csv |
Reference list of known grape variety names |
Raw data required significant preprocessing before it could be used for modelling:
- Column standardisation β Column names were lowercased and stripped of whitespace for consistency across notebooks.
- Text normalisation β Wine names and regions were normalised using Unicode NFKD encoding and regex to remove accents and special characters.
- Year cleaning β The
yearcolumn was coerced to numeric, and invalid or missing values were dropped. A derivedwine_agefeature was created (2026 β year). - Log transformation β
price,numberofratings, andwine_agewere right-skewed and were log-transformed and standardised to better suit the model.
This was one of the most challenging parts of the project. The dataset contained no explicit grape variety column β only wine names and regions, which are often vague or producer-branded. We extracted variety information through:
- Regex pattern matching against a known variety list (longest match first to avoid partial clashes)
- Region-based inference using a hand-curated mapping (e.g.,
pomerolβ Bordeaux Blend,baroloβ Nebbiolo,riojaβ Tempranillo) - Fallback categories for wines that couldn't be identified:
Rare Varietiesβ real but uncommon grapes too sparse to keep as individual classes (e.g., Tannat, MourvΓ¨dre, Gamay)Unspecified Redβ wines with intentionally vague labels (e.g., "Rosso Toscana", "Red Wine")Unknown Varietyβ no identifiable grape or region information (e.g., "Grande CuvΓ©e", "Tradition")
The final regex pattern used for variety extraction β matching whole words only, with longer variety names prioritised first to avoid partial clashes.
A Pearson correlation and pairplot confirmed that:
- Price has the strongest correlation with rating among numerical features
- Year / wine age contributes some predictive power
- NumberOfRatings adds very little signal
- Features show low inter-correlation β minimal multicollinearity risk
Goal: Predict red wine ratings based on features
Model: KNeighborsRegressor
k-value: 45
Preprocessing: MinMaxScaler and Log Transformation for Price, Number of Ratings, Wine Age
Target: Rating
Features: Price, Wine Age, Number of Ratings, Winery
RΒ² score: 0.58
Model accuracy: 57.87%
In this project, we developed a K-Nearest Neighbours (KNN) regression model to predict wine ratings based on several features derived from the dataset.
First, exploratory data analysis was conducted to understand relationships between variables and detect skewed distributions. Since some numeric variables were strongly right-skewed, log transformations were applied to price, number of ratings and wine age to improve their distributions. Additionally, feature scaling (MinMax scaling) was applied to variables to ensure that all numerical features were on comparable scales, which is important for distance-based algorithms like KNN.
We then performed feature engineering, including the creation of a wine age variable derived from the wine's production year. The dataset was split into training and testing sets using an 80/20 split to evaluate the model's ability to generalize to unseen data.
Categorical variables such as country and winery were converted into numerical format using one-hot encoding, allowing them to be used as input features in the model.
The KNN regression model was trained on the prepared training dataset. Since KNN is a distance-based algorithm, feature scaling was necessary to prevent variables with larger ranges from dominating the distance calculations. Different numbers of neighbors (k values) were tested to determine the model configuration that produced the best predictive performance. Ultimately the best k-value was 45. After testing a few combinations for features the best result was given by the combination of the numeric values and winery.
Finally, the trained model was evaluated on the test dataset using the RΒ² score, which measures how well the predicted ratings explain the variation in the actual wine ratings.
Although we recognize that other models such as Random Forest could potentially perform better for predicting ratings using high-cardinality categorical variables like winery and country, the KNN approach still provided valuable insights. In particular, the analysis highlighted the strong influence of price and number of ratings on wine ratings, helping us better understand the key factors associated with higher-rated wines.
The dataset is heavily concentrated in Europe, particularly France, Italy, and Spain. This introduces a structural bias that influences observed patterns throughout the analysis.
There is only a weak relationship between price and rating. While expensive wines tend to avoid low ratings, higher price does not guarantee significantly higher quality. Ratings remain tightly clustered across all price levels β most wines fall between 3 and 4 stars regardless of cost.
The number of ratings was used as a proxy for popularity. Countries like the United States, Spain, and Italy show the highest engagement, which may reflect stronger platform usage, broader distribution, or higher accessibility. Notably, many of the most reviewed wines are not the most expensive β suggesting that accessibility and visibility drive popularity more than price alone.
A value score (rating Γ· price) was used to identify wines offering the best quality per euro. Countries such as Chile, Portugal, and Spain rank highest in value. Filtering out countries with small sample sizes ensures more reliable comparisons, and the results demonstrate that high-quality wines are not limited to traditional premium regions.
KNN has limits for this problem. KNN struggles with high-cardinality categorical features after one-hot encoding β adding all categorical variables (winery, region, variety, name) actually reduced RΒ², suggesting they introduced more noise than signal. Tree-based models like Random Forest or Gradient Boosting would likely handle this better.
variety_ml was expensive to build. Extracting grape variety from unstructured wine names required a combination of regex, region inference, and manual rule-curation. It was one of the most time-consuming parts of the project β and it still doesn't cover every case perfectly. That said, we're proud of how much structure we managed to extract from what was essentially free text.
Rating prediction is inherently hard. Wine ratings are subjective, reviewer-influenced, and contextual. Even the best-structured model will have a ceiling on how much variance it can explain from price, region, and variety alone.
Wine ratings are influenced by multiple factors, and no single variable β including price β determines quality on its own. Ratings are tightly clustered, popularity is driven more by accessibility than price, and strong value can be found across diverse regions well beyond the traditional premium names.
Perceptions of "premium wine" are shaped by centuries of historical and cultural context. Platforms like Vivino inherit that legacy β and with it, the risk of reinforcing rather than challenging it.
What we think of as premium wine is often shaped by history β not just by the data.
Visual exploration was performed using Python and Tableau.
Python libraries used:
pandasβ data manipulation and feature engineeringnumpyβ numerical operations and log transformationsmatplotlib&seabornβ histograms, scatterplots, correlation heatmaps, bar chartsscipyβ KS normality tests and Q-Q plotssklearnβ modelling, scaling, train/test split, KNN
Tableau was used for interactive exploration of country, winery, and variety-level patterns, allowing drill-down by region and price tier. https://public.tableau.com/app/profile/rachel.vianna/viz/Vivino_Insights/Dashboard1
pandas
numpy
matplotlib
seaborn
scipy
scikit-learn
pyyaml
Prezi Presentation https://prezi.com/p/cazr8sktltmc/?present=1
Beatriz Fernandes Β· Rachel Vianna Β· Victoria Cano
Bootcamp group project β Week 7