This project simulates a real-world classification problem: given a set of images, the goal is to build a classifier that detects images containing superimposed text related to alcoholic beverages. The pipeline involves data exploration, analysis, and the construction of a robust classification system that achieves strong performance metrics.
- Explore the dataset and understand the inputs/outputs.
- Design and explain a classification pipeline architecture.
- Suggest and use appropriate metrics for performance assessment.
- Implement the pipeline using the provided data.
- Report results and share valuable findings.
- A set of images containing text, split into two folders:
AlcoholandNon-alcohol, based on whether the text mentions alcohol.
- Python 3.11
- Install dependencies:
pip install -r requirements.txt
- Place your images in a folder (e.g.,
Images/). - Run the main script to classify images:
python run.py Images
- The script will output a
results.csvfile with two columns:file_nameandprediction(eitheralcoholornonalcohol).
file_name,prediction
image1.jpg,alcohol
image2.jpg,nonalcohol
- The dataset consists of images with superimposed text, divided into
AlcoholandNon-alcoholcategories. - Example: An image with the text "Don't miss limited edition nightlife bourbon event" is labeled as
alcohol, while "energy drink" is labeled asnonalcohol. - The initial data exploration revealed several challenging issues that needed to be addressed:
- Inverted/flipped images with upside-down or mirrored text
- Poor text-background contrast where text color was too similar to background
- Text visibility issues with blurred or unclear text
- Word splitting/merging errors (e.g., "wat er" instead of "water")
- Character recognition errors with numbers appearing instead of letters
- Irrelevant text content that could confuse classification
To address image-level challenges, a comprehensive preprocessing pipeline was developed:
- Image orientation detection: For each image, both original and flipped versions were processed, with selection based on OCR confidence scores
- Color space separation: Images were converted to RGB and YUV color spaces to improve text visibility in cases of poor contrast
- Multi-variant processing: Each image was processed through multiple combinations (original/flipped × RGB/YUV color spaces) with the highest confidence result selected
Extracted text underwent several cleaning steps:
- Case normalization: All text converted to lowercase
- Character mapping: Numbers mapped to likely letter equivalents (e.g., '1' → 'i' or 'l')
- Punctuation and whitespace cleanup: Removed irrelevant characters and normalized spacing
- Noise reduction: Filtered out non-relevant text content
A hybrid classification system was implemented:
- Keyword-based classifier: Used curated lists of alcohol-related and non-alcohol-related terms with fuzzy string matching
- Neural network classifier: Fine-tuned DistilBERT model for text classification
- Hybrid logic: Combined both approaches, using keyword matching for clear cases and falling back to the neural network for ambiguous cases
The final pipeline showed significant improvements across all metrics:
- OCR quality: Substantial reduction in images returning "no_text" results
- Average confidence scores: Marked improvement in OCR confidence across all images
- Classification metrics: Strong performance on precision, recall, F1-score, and accuracy
- Robustness: Effective handling of various image quality issues and text extraction challenges
The majority of development time was invested in image processing and text extraction optimization, which proved critical for achieving good classification results.
- The pipeline is CPU-only and does not require a GPU.