This project focuses on solving the Street View Character Recognition Task, which involves recognizing house numbers from street view images. The task is framed as a character recognition problem in computer vision, leveraging a dataset derived from real-world scenarios.
The dataset is sourced from the Street View House Numbers Dataset (SVHN) and has been processed for the competition. It includes:
- Training Set: 30,000 images with RGB data, encoded labels, and bounding box information.
- Validation Set: 10,000 images with the same format as the training set.
- Test Set: 40,000 images, provided without label information.
Example Data Format:
"000001.png": {
"height": [32, 32],
"label": [2, 3],
"left": [77, 98],
"top": [29, 25],
"width": [23, 26]
}A baseline implementation baseline.ipynb is provided to guide participants, which includes:
-
Dataset Preparation:
- Download and decompress the dataset.
- Process the training and validation sets.
- Handle empty labels by encoding them as class
10.
-
Model Design:
- The task is converted into a multi-digit classification problem, where each digit is classified as
0-9or "empty". - Only the first four digit positions are considered for prediction.
- The task is converted into a multi-digit classification problem, where each digit is classified as
-
Loss Function:
- A Cross-Entropy Loss is used with label smoothing. For example:
- Original label:
[1, 0, 0]. - Smoothed label:
[0.9, 0.05, 0.05].
- Original label:
- Benefits:
- Improves model generalization.
- Reduces overfitting and overconfidence.
- Increases robustness to label noise.
- A Cross-Entropy Loss is used with label smoothing. For example:
-
Training and Validation:
- Training involves basic steps of gradient zeroing, loss computation, backpropagation, and parameter updates.
- Validation is used to evaluate performance on unseen data.
-
Evaluation:
- Submit predictions on the test set for evaluation.
Participants are encouraged to improve upon the baseline referencing the following strategies:
-
Model Enhancements:
- Use deeper or more advanced convolutional neural networks (CNNs) or other architectures.
- Explore object detection models (e.g., YOLO series) to treat digits as separate detection classes.
- Leverage pre-trained scene text detection/recognition models and fine-tune them on the dataset, or train them from the random initialization.
-
Data Augmentation:
- Experiment with data augmentation techniques to improve generalization.
-
Loss Function and Hyperparameter Tuning:
- Adjust the loss function and optimize hyperparameters to improve performance.
-
Model Ensemble:
- Combine predictions from multiple models using weighted voting or other ensemble techniques.
Participants must submit the following:
- Test Set Results:
- Submit results on the test set through the official platform for evaluation.
- Scoring:
- ≥ 0.86: +1 point
- ≥ 0.88: +2 points
- ≥ 0.90: +3 points
- ≥ 0.92: +4 points
- Implementation Report:
- Provide a detailed report (up to 4 pages) covering:
- Environment setup.
- Model design and loss function explanation.
- Strategies for improving test set performance (e.g., data augmentation, parameter tuning).
- Innovations or modifications made to existing methods.
- Challenges encountered and how they were addressed.
- Provide a detailed report (up to 4 pages) covering:
- Code:
- Submit the complete implementation code (excluding the dataset).
Submission Format:
- Name the submission file as
StudentID_Name_PJ1.zip - Submit to the elearning platform by April 21, 2025, 23:59.
- Tianchi Competition Platform
- YOLO Series GitHub Repository
- PaddleOCR Documentation
- OpenOCR GitHub Repository
The final score is calculated as the sum of:
- Test Set Results: Maximum 4 points.
- Implementation Introduction: Maximum 2 points.
- Improvements and Innovations: Maximum 3 points.
If the total score exceeds 8 points, it will be capped at 8 points.
Good luck with your project!