BeautyBytes is a full-cycle data analytics project aimed at uncovering trends, customer preferences, and brand insights in the online cosmetics industry using a dataset of 15,000 makeup products. The project includes:
β
Market research & sales trend analysis
β
Content-based recommendation system
β
Brand performance & customer feedback analysis
β
Behavioral segmentation for targeted marketing
β
End-to-end Power BI dashboard
- Python (Pandas, Seaborn, Scikit-learn, TfidfVectorizer)
- Power BI (for executive visuals & trends)
- Google Colab Notebook
- Cosmetology product dataset with 14 features
- Most common product categories: Serum, Mascara, Face Oil

- Ingredient popularity: Glycerin, Retinol, Vitamin C

- Highest rated skin-type focus: Combination & Oily

- Highest Products sold in : Italy, USA

- Packaging Type Preferred: Jar

- Uses category, main ingredient, skin type & packaging
- Built using TF-IDF and cosine similarity
- Returns 5 similar products for any selected item
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity
df['Combined_Features'] = df['Category'] + ' ' + df['Main_Ingredient'] + ' ' + df['Skin_Type'] + ' ' + df['Packaging_Type']
tfidf = TfidfVectorizer() tfidf_matrix = tfidf.fit_transform(df['Combined_Features'])
cosine_sim = cosine_similarity(tfidf_matrix)
def recommend(product_name, top_n=5):
idx = df[df['Product_Name'] == product_name].index[0]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
recommended = df.iloc[[i[0] for i in sim_scores]]['Product_Name'].tolist()
return recommended
print('A user who likes Ultra Face Mask might also enjoy products with similar skin-type targets and ingredients such as: ',recommend('Ultra Face Mask'))
A user who likes Ultra Face Mask might also enjoy products with similar skin-type targets and ingredients such as:
['Magic Foundation', 'Super Cc Cream', 'Ultra Foundation', 'Ultra Eye Shadow', 'Divine Face Mask']
- Popularity score: Rating Γ log(Number of Reviews)
- Brands like HourGlass, Milk Makeup, Becca lead in satisfaction
import numpy as np df['Popularity_Score'] = df['Rating'] * np.log1p(df['Number_of_Reviews'])
brand_stats = df.groupby('Brand')[['Rating', 'Number_of_Reviews', 'Popularity_Score']].mean().sort_values(by='Popularity_Score', ascending=False)
π Pages:
- Treemap: Brand dominance
- Barplot : Category vs Rating
- Bubble chart: Price vs Rating vs Review count
- Donut: Product origin distribution

- Italy & USA lead in product volume, but Japan & France offer higher-rated items
- Retinol and Glycerin are most associated with high ratings
- Daily-use products are more affordable and receive more reviews
- Cruelty-free products show higher pricing and better customer feedback
- Sensitive-skin products are the highest rated across the board
π_BeautyBytes_...ipynbβ Full notebook with EDA + MLMakeup-Sales-Trend-Analysis.pdfβ Power BI dashboard exportbeauty_products_clean.csvβ Preprocessed dataREADME.mdβ This file
- Integrate real-time reviews via Sephora API
- Add NLP sentiment scoring on actual text reviews
- Deploy recommendation engine with Streamlit or Flask
If you're a data team, beauty brand, or just love analytics & ecommerce β letβs talk!
π§ Email β’ πΌ LinkedIn β’ π§ Portfolio


