Multilingual Query-Category Relevance System
Problem Statement This system solves the Multilingual Query-Category Relevance task for e-commerce platforms. It determines whether a user's search query is semantically relevant to a given product category hierarchy.
Features
Core Capabilities
- Multilingual Support: Handles 20+ languages including English, Spanish, French, German, Chinese, Japanese, Korean, Russian, and more
- Advanced Rule-Based System: Sophisticated heuristics with domain-specific knowledge
- Real-time Inference: Fast batch processing with progress tracking
- Confidence Scoring: Each prediction includes confidence levels
- Interactive Dashboard: Beautiful Streamlit interface with data visualization
Algorithm Features
- Keyword Matching: Direct and partial word overlap analysis
- Category-Specific Rules: Domain knowledge for electronics, clothing, sports, beauty, etc.
- Brand Recognition: Identifies and matches common brand names
- Language Detection: Automatic language identification and handling
- Query Analysis: Length-based heuristics and sentiment detection
Installation & Setup
Quick Start
-
Install Dependencies
pip install -r requirements.txt
-
Run the Application**
python3 -m streamlit run query_category_relevance_app.py
-
Access the Web Interface**
- Open your browser and go to:
http://localhost:8501
- Open your browser and go to:
Usage Guide
- Data Format Your CSV file must contain these columns:
- Query: The search query (e.g., "red running shoes")
- L1: Top-level category (e.g., "Sports & Outdoors")
- L2: Mid-level category (e.g., "Athletic Shoes")
- L3: Leaf category (e.g., "Running Shoes")
-
Upload & Process
-
Review the dataset overview and language distribution
-
Click "Run Inference" to generate predictions
-
Download results with predictions and confidence scores
-
Output Format The system generates:
- Prediction: 1 (Relevant) or 0 (Not Relevant)
- Confidence: Numerical confidence score (0.0 to 1.0)
- Prediction_Label: Human-readable label
Model Performance
Evaluation Metric
- Primary: F1-Score on positive class (Relevant = 1)
- Formula: F1 = (2 × Precision × Recall) / (Precision + Recall)
System Strengths
- High Precision: Advanced heuristics minimize false positives
- Language Adaptability: Unicode support and cross-lingual patterns
- Domain Knowledge: Category-specific rules for better accuracy
- Scalability: Efficient batch processing for large datasets
Advanced Features
Language Detection Automatic identification of:
- Romance Languages: Spanish, French, Italian, Portuguese
- Germanic Languages: German, Dutch, English
- Slavic Languages: Russian, Polish, Czech
- Asian Languages: Chinese, Japanese, Korean
- And more: 20+ languages supported
Category Intelligence Domain-specific knowledge for:
- Electronics: Phones, laptops, cameras, audio devices
- Clothing: Shirts, shoes, accessories, seasonal wear
- Sports: Equipment, fitness items, outdoor gear
- Beauty: Makeup, skincare, fragrances
- Home & Kitchen: Appliances, furniture, decor
- Automotive: Parts, accessories, maintenance items
Smart Matching
- Exact Word Matching: Direct overlap scoring
- Partial Matching: Substring and similarity detection
- Brand Recognition: Common brand name identification
- Negative Sentiment: Detection of exclusion terms
- Query Complexity: Length and specificity analysis
Performance Optimization
Batch Processing
- Configurable batch sizes for memory optimization
- Progress tracking with visual indicators
- Efficient tensor operations
Technical Architecture
Core Components
- Data Preprocessor: Text cleaning and normalization
- Language Detector: Multilingual text analysis
- Feature Extractor: Query and category feature engineering
- Rule Engine: Advanced heuristic scoring system
- Confidence Calculator: Prediction reliability assessment (removed from front end but still exists)
- Results Manager: Output formatting and export