This project is oriented around personal learning and involves training image classification models and building a database system. It aims to automatically classify and retrieve information for 120 common vector species (rats, mosquitoes, cockroaches, and flies). The following documents the work completed and lessons learned at each stage of the project.
Demo:
demo.mp4
(Note: The high-resolution demo image is available at the bottom of this document.)
. ├── database/ # Database module │ ├── data_csv/ # Raw CSV data and related scripts │ ├── csv_to_db.py # Script to import CSVs into the database │ ├── queries.sql # Sample SQL query statements │ └── schema.sql # Database schema definition ├── dataset/ # Dataset processing │ ├── classes.txt # Class label names │ ├── download_image.py # Script to download images │ ├── download_link.py # Script to extract download links │ ├── extract_label.py # Script to extract labels │ └── split.py # Script to split train/val sets ├── predict/ # Inference and lookup module │ ├── look_up.py # Query species info from database │ ├── predict.py # Main model inference script │ └── workwork.py # Combined "classify + lookup" script ├── train_model.ipynb # Model training notebook using timm ├── train_torch.ipynb # Custom model training notebook using PyTorch ├── README.md ├── LICENSE └── .gitignore
Project repository: GitHub
- Learning Outcomes
- Database Construction
- Model Training
- Classification–Query Workflow
Throughout this project, I gained substantial technical skills and knowledge, and achieved meaningful growth across multiple areas.
When I first began, I consulted AI and online encyclopedias to understand what “vector organisms” are. I then clarified that the task falls under computer vision, specifically image classification within the fields of machine learning and deep learning. I searched and reviewed literature using keywords like "vector", "pest", "machine learning", and "image classification", and compiled a simple review of technical approaches in this domain.
As the project progressed, I continued exploring relevant literature to deepen my understanding. To better grasp the development of deep learning architectures, I studied classic papers including AlexNet, Transformer, and Swin Transformer. For data lineage, I referred to IBM’s technical webpages and papers on integrating lineage with machine learning, exploring how such techniques could help assess the trustworthiness of training data. I also studied survey and technical papers on image tampering detection to learn about detection techniques and their underlying algorithms.
On the implementation side, I learned the fundamentals of model training using PyTorch. I initially attempted to replicate the winning solution of the iNaturalist Competition 2021 but later switched to fine-tuning a smaller pretrained model due to limited hardware. To address potential prediction errors and fulfill users’ need for background information on species, I designed a SQLite-based database system to store species information and linked it to the classification module—creating a full “classify–query” pipeline.
In doing so, I became proficient in using SQL and Python in practical tasks. I also improved my ability to solve real-world problems by writing scripts for data collection, preprocessing, and system integration.
Additionally, I realized how crucial data quality is for model performance. I explored concepts like "data lineage" and "image credibility" to evaluate the reliability of training data, and compiled these notes:
I also discovered personal research interests along the way. For instance, while reading Mora, Camilo et al.’s "How many species are there on Earth and in the ocean?", I became deeply interested in their regression-based methodology of “inferring the unknown from the known.” Meanwhile, studying SQL and deep learning gave me a more grounded understanding of my coursework—be it the computational foundations of SQL query optimization, or the calculus, statistics, and linear algebra behind deep learning models.
In particular, I came to appreciate how model interpretability is not only a technical challenge, but also a promising research direction. It could significantly enhance transparency, help explain model decisions, and even uncover new knowledge.
In summary, I developed the following core capabilities through this project:
- Literature review and research skills: Independently searching, reading, and summarizing technical papers
- Interdisciplinary learning: Gaining insights into computer vision, image forensics, data lineage, and model interpretability
- Technical proficiency: Practical experience with SQL, PyTorch, data processing, model training, database design, and system integration
—Related code located in the /database directory—
The database is built using SQLite.
The schema consists of four core tables and several bridge tables. The ER diagram is shown below:
Each table stores the following information:
species: id, scientific name, Chinese name, common name, distinguishing featurestaxonomies: id, name, Chinese name, rank (phylum, class, order, family, genus)diseases: id, name, symptomslocations: id, name, category (province, country, region)
Advantages of using multiple tables:
- Improved consistency: Reusable entities (e.g., taxonomy, diseases, geographic locations) reduce typos and redundancy
- Faster queries: E.g., to find "species distributed in a given area," the query avoids expensive full-text search
- Better structure: Easier to store nested or hierarchical attributes like taxonomy levels and disease symptoms
To initialize the database:
The schema is already defined in schema.sql. Run:
sqlite3 pests.db
.read schema.sqlTo populate data:
Manually entering records via SQL is time-consuming for larger datasets, so a script csv_to_db.py is provided to automate the process:
- Fill in the corresponding CSV files in the
data_csv/folder - Configure the path, table name, and mode (
replaceorappend) at the top ofcsv_to_db.py - Run the script to import data into the database
This section introduces the entire workflow from data collection to model training, including some steps for evaluating data quality and reliability.
—Related code located in the /dataset directory—
The training dataset consists of two parts:
- 20 species selected from the open-source iNaturalist 2021 Dataset
- Images of additional species collected from the iNaturalist website (Research Grade)
Open-source dataset:
The iNaturalist Dataset was chosen due to the following advantages:
- Large scale: Over 10,000 species and 2.7M+ labeled images at the species level
- High quality: Labeled by citizen scientists and verified by experts; adheres to COCO-format standards (Van Horn et al.)
- Realistic settings: Images captured in natural conditions with diverse backgrounds — suitable for real-world application scenarios
- Easy access: Publicly available on GitHub and supported by PyTorch built-ins
Alternative datasets considered but not chosen:
- ImageNet: Although classic, it contains many irrelevant categories (e.g., vehicles) and lacks strict biological taxonomy
- IP102: Focuses on agricultural pests but suffers from inconsistent labels (misspellings, mixed taxonomic ranks) and limited coverage of the 4 vector types targeted in this project
From the iNaturalist 2021 Dataset, 20 relevant species were selected:
- 5 mosquitoes
- 3 rodents
- 4 flies
- 8 cockroaches
Each species has 50 images in the train_mini subset.
Online image sources:
After extracting partial data from open datasets, the remaining species were supplemented using animal information platforms. iNaturalist was selected again for consistency and convenience:
- Same source as iNaturalist Dataset ensures label and style consistency
- Inherits all other advantages mentioned above
- Convenient API and GBIF support for retrieval
Other platforms considered but not used:
- Encyclopedia of Life (EOL): Offers multimodal content (photos, audio, video), but has limited image data — not ideal for image classification alone
- Chinese Animal Thematic Database: Hosted by the Institute of Zoology, Chinese Academy of Sciences; comprehensive domestic records but outdated and difficult to access
Image download options:
For iNaturalist Dataset 2021:
- Download from official GitHub repo
- Use PyTorch's built-in dataset loader
For iNaturalist Website data:
- Use iNaturalist API
- a. List scientific names in
classes.txt(e.g., Culex pipiens) - b. Configure download settings in
download_image.py - c. Run
download_image.py
- a. List scientific names in
- Use GBIF archive
- a. Filter Research-grade observations on GBIF and download the Darwin Core Archive
- b. Set paths and limits in
download_link.py - c. Run
download_link.pyto extract and download image URLs from the archive
Data processing:
- Cleaning: Manually remove blurry or irrelevant images (e.g., rat tracks, larvae, skulls)
- Splitting: Use
split.pyto divide into training and validation sets
This project uses a fine-tuned ConvNeXt-Tiny model pretrained on ImageNet. This model was chosen for its small size and balanced performance among lightweight architectures. The primary goal here is experimental — to validate functionality rather than optimize performance.
Two versions of training code are used:
train_model.py: Written with PyTorch andtimmlibrary, developed with assistance from ChatGPT (GPT-4 o3)train_torch.py: A minimal version built using only PyTorch, written independently for learning purposes
train_model.py
This script offers better training speed and results. It incorporates techniques such as:
- Freezing the backbone for a few epochs to train a new linear head
- Using CutMix and MixUp
- Label smoothing
Below are training results on the small-scale dataset (20 species from iNaturalist 2021):
| Attempt | Train Accuracy | Val Accuracy | Key Adjustments |
|---|---|---|---|
| 1st | 99.9% | 67% | Default settings, initial training |
| 2nd | 97.4% | 70.5% | Learning rate changed from 3e-4 to 1e-4 |
| 3rd | 98.02% | 73.23% | Cleaned dataset (removed blurry, irrelevant images) |
| 4th | 96.25% | 67.68% | Freeze epochs increased from 1 to 3; overfitting occurred around epoch 6 |
| 5th | 88.44% | 73.23% | Freeze epochs to 5; added label smoothing, CutMix, and MixUp |
| 6th | 79.69% | 69.19% | Epochs increased from 12 to 15; plateaued with eventual decline |
| 7th | 83.44% | 70.2% | Epochs reset to 12; MixUp alpha 0.2 → 0.1; CutMix alpha 1.0 → 0.5 |
—Related code located in the /predict directory—
To enable end-to-end inference from image input to species information output, I implemented APIs for database querying and image classification, as well as an integrated script that combines both.
Relevant code is in look_up.py. Main functions/APIs include:
load_database(): Loads the database and returns a cursorlook_up(): Queries the database for species info based on the classification resultformat_db_output(): Formats the returned info as a printable stringmain(): A usage example
Relevant code is in predict.py. Main functions/APIs include:
load_model(): Loads the trained modelget_transforms(): Returns the preprocessing transformspredict_one(): Predicts the image label and returns the top-k resultsmain(): Command-line entry point
Using the wrapped APIs, the script workwork.py allows you to input an image path and receive both the classification result and associated species information.
Usage:
python workwork.py <image_path>Example Output:
This project is intended for personal learning and exploration. It is licensed under the MIT License.
You are welcome to reference, reuse, suggest improvements, or share ideas!
