This project explores handwritten digit recognition using basic machine learning algorithms written from scratch in C++.
It is not a finished system. It is a learning-focused implementation.
The system reads raw dataset files (MNIST format), processes them, and applies a classification algorithm.
Main components:
- Data: Represents a single data sample (image + label)
- DataHub: Handles dataset loading and association
- KNN: Implements k-nearest neighbors classification
- MNIST handwritten digit dataset
- Binary IDX file format
- Image and label files are manually parsed
- Distance-based classification
- Uses dataset loaded through DataHub
- Example usage:
K = 3- Finds nearest samples and predicts label
/Data
- Data representation and parsing
/Data Hub
- Dataset loading and management
/KNN Algorithm
- KNN implementation
/archive
- Dataset files (MNIST)
/main.cpp
- Entry point
- Load dataset paths (train + test)
- Parse IDX files into memory
- Associate images with labels
- Run KNN on test data
- Predict labels based on nearest neighbors
- Written for learning purposes
- Hardcoded dataset paths (needs refactor)
- No optimization for large datasets
- Error handling is minimal
- Some unstable behavior during dataset parsing
- No validation for corrupted files
- Performance drops with large input size
- Remove hardcoded paths (make configurable)
- Improve parsing robustness
- Add dataset normalization
- Optimize KNN (distance calculation, memory usage)
- Add accuracy evaluation (confusion matrix, metrics)
mkdir build
cd build
cmake ..
make
RunMake sure dataset paths are correctly set inside main.cpp.
The goal is to understand:
- How raw data is handled in ML systems
- How simple algorithms like KNN actually work internally
- Memory and performance constraints in low-level implementations