DATASET

The dataset was constructed from a variety of public data sources detailed below. Each image was manually reviewed and a total of 12,288 images per class were randomly selected: 10,240 train, 1,024 test, and 1,024 validate. Images are licensed by their originating institutions under some form of Creative Commons license (CC0, CC-BY, CC-BY-NC, or CC-BY-SA).

Category	AK	ASU	BHL	BR	C	CAS	CHNDM	COLO	E	F	FMNH	GH	K	KY	L	LY	MA	MCZ	MICH	MO	MPU	MZH	Met	NCU	NHMD	NHMO	NMR	NY	O	P	RSA	SDNHM	TEX	TRH	TTU	TU	Tw	UA	UHIM	UMMZ	US	YPM	YU	Total
Animal specimens	858	807	0	0	0	858	0	0	0	0	857	0	0	857	0	0	0	848	0	0	0	857	0	0	417	857	857	0	0	0	0	505	0	0	388	857	0	857	91	676	0	840	1	12,288
Biocultural specimens	1	0	0	0	1,272	0	714	0	0	3,026	0	0	1,925	0	47	1	0	0	0	141	0	0	5,157	0	0	0	0	3	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	12,288
Corrupted images	116	0	0	419	113	240	0	104	387	943	0	362	0	0	310	702	70	0	593	151	215	0	0	246	0	0	0	2,992	498	1,633	483	0	55	138	0	0	0	0	0	0	1,071	0	447	12,288
Fragmentary pressed specimens	44	0	0	1,132	0	294	0	50	209	1,632	0	242	0	0	1,418	1,268	57	0	81	11	152	0	0	299	0	0	0	1,502	99	1,909	222	0	84	156	0	0	0	0	0	0	1,398	0	29	12,288
Illustrations color	1	0	11,153	218	0	1	0	1	16	48	0	1	28	0	3	6	0	0	0	39	10	0	0	0	0	0	0	711	0	41	10	0	0	0	0	0	0	0	0	0	1	0	0	12,288
Illustrations gray	3	0	7,553	3,278	0	5	0	1	177	299	0	4	0	0	67	4	41	0	1	26	91	0	0	0	0	0	0	57	2	599	71	0	3	0	0	0	0	0	0	0	4	0	2	12,288
Live plants	1,630	0	0	909	0	9	0	13	1,638	530	0	3	3	0	10	41	11	0	7	1,631	7	0	0	257	0	0	0	1,630	10	1,630	53	0	120	507	0	0	0	0	0	0	1,631	0	8	12,288
Micrographs electron	0	0	0	0	0	3	0	0	6	0	0	0	0	0	0	0	1	0	0	1	1	0	0	0	0	0	0	3,864	0	48	0	0	1	0	0	0	2,301	0	0	0	6,061	0	1	12,288
Micrographs reflected light	118	0	0	582	70	15	0	72	178	2,078	0	315	0	0	20	566	13	0	642	98	253	0	0	25	0	0	0	2,371	768	2,914	4	0	0	148	0	0	71	0	0	0	121	0	846	12,288
Micrographs transmission light	5	0	0	0	0	0	0	0	1	0	0	0	4,577	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	69	0	0	0	0	0	0	7,609	0	0	0	25	0	0	12,288
Microscope slides	0	0	0	0	0	0	0	0	0	0	0	0	1,354	0	10,934	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	12,288
Mixed pressed specimens	12	0	0	292	0	864	0	40	1,259	300	0	1,235	0	0	154	803	16	0	158	3	88	0	0	39	0	0	0	2,177	553	1,444	575	0	14	30	0	0	0	0	0	0	1,156	0	1,076	12,288
Occluded specimens	8	0	0	178	300	38	0	34	143	275	0	114	0	0	983	1,874	21	0	1,676	1	28	0	0	612	0	0	0	2,384	545	1,783	171	0	13	401	0	0	0	0	0	0	630	0	76	12,288
Ordinary pressed specimens	116	0	0	418	114	240	0	104	387	943	0	362	0	0	310	702	70	0	593	151	215	0	0	246	0	0	0	2,992	498	1,625	483	0	55	138	0	0	0	0	0	0	1,079	0	447	12,288
Specimen reproductions	2	0	0	622	2	1	0	10	651	9,824	0	8	0	0	87	34	4	0	213	4	23	0	0	4	0	0	0	116	3	50	454	0	167	2	0	0	0	0	0	0	6	0	1	12,288
Text focused	59	0	0	751	0	143	0	223	208	19	0	13	2	0	69	67	276	0	198	536	629	0	0	443	0	0	0	2,201	457	1,388	94	0	26	41	0	0	1,372	0	0	0	2,993	0	80	12,288
Unpressed specimens	341	0	0	159	157	63	0	118	27	1,733	0	15	763	0	203	4	1	0	99	67	9	0	0	29	0	0	0	4,838	1	102	1,409	0	203	26	0	0	1,810	0	0	0	103	0	8	12,288
Total	3,314	807	18,706	8,958	2,028	2,774	714	770	5,287	21,650	857	2,674	8,652	857	14,615	6,072	581	848	4,261	2,860	1,721	857	5,157	2,200	417	857	857	27,840	3,434	15,235	4,029	505	742	1,587	388	857	13,163	857	91	676	16,279	840	3,022	208,896

IMAGES

TensorFlow Record (.tfr) files containing 96² pixel images (JPEG format) are available for both the main dataset [1.3G] and the additional GBIF specimen survey test data [54M].

Category	Index
Animal	0
Biocultural	1
Corrupted	2
Fragmentary	3
Color illustration	4
Grayscale illustration	5
Live plant	6
Electron Micrograph	7
Reflected Light Micrograph	8
Transmission Light Micrograph	9
Microscope slide	10
Mixed	11
Occluded	12
Ordinary	13
Reproduction	14
Text	15
Unpressed	16

CODE

Code was developed and tested using TensorFlow 2.13.0 official Docker images.

The ConvNeXt-T can be created by saying:

MODEL="ConvNeXt-T.keras"
SEED=89178525
SIZE=224
./convnext.py -a 21000 -b 3 -e 4 -f gelu -i $SIZE -n 3:3:9:3 -o $MODEL -r $SEED -x 96 ### 43,762,344 parameters

Distillation of the fully–trained 3D-OFDB-21k DeiT published by Nakamura et al. (2023) onto the ConvNeXt-T is accomplished by saying:

DIR="ConvNeXt"
GPU=0
mkdir -p $DIR
./distillOptimizeImagesR.py -a 21000 -e 32 -g $GPU -i $SIZE -m 3DOFDB21kViTB16-224-TF -o $DIR -r $SEED -S $SIZE -s $MODEL -t 3D-OFDB-21k-224-train-microcosm.tfr -v 3D-OFDB-21k-224-test.tfr -l manual/

Transfer learning for the pretrained ConvNeXt-T with the Herbariograph dataset is performed by saying:

LAST=$(ls -ltr $DIR/*/best-model.keras | awk -F"/" "{print \$2}" | tail -1)
LAYERS=("output_gap" "convNeXt3_downSample_layerNormalization" "convNeXt2_downSample_layerNormalization" "convNeXt1_downSample_layerNormalization")
MODEL="tune-model.keras"
REFERENCE=$DIR"-classifier.keras"
SIZE=96
./convnext.py -a 17 -b 3 -e 4 -f gelu -i $SIZE -n 3:3:9:3 -o $REFERENCE -r $SEED -x 96 ### 27,626,417 parameters
./convnextClassifier.py -e $DIR"/"$LAST"/best-model.keras" -m $REFERENCE -o $DIR"/"$LAST"/"$MODEL
for LAYER in "${LAYERS[@]}"; do
   MODEL=$(if [ "${LAYERS[0]}" == $LAYER ]; then echo $MODEL; else echo "best-model.keras"; fi)
   ./trainImagesC.py -a 17 -b 256 -d 0.00005 -e 4096 -f ce+clr+aw -g $GPU -i $SIZE -l 0.00005 -m $DIR"/"$LAST"/"$MODEL -o $DIR -r $SEED -s $LAYER -t herbariograph-96-train.tfr -v herbariograph-96-validation.tfr
   LAST=$(ls -ltr $DIR/*/best-model.keras | awk -F"/" "{print \$2}" | tail -1)
done
./trainImagesC4096.py -a 17 -b 256 -d 0.00005 -e 4096 -f ce+clr+aw -g $GPU -i $SIZE -l 0.00005 -m $DIR"/"$LAST"/"$MODEL -o $DIR -Q -r $SEED -s rescale -t herbariograph-96-train.tfr -v herbariograph-96-validation.tfr
LAST=$(ls -ltr $DIR/*/best-model.keras | awk -F"/" "{print \$2}" | tail -1 | perl -pe "s/-best/-intermediate/")
./imageSoup.py -a 17 -b 256 -d herbariograph-96-train.tfr -g $GPU -m $DIR"/"$LAST -o $DIR -T 60 -v herbariograph-96-validation.tfr

The final model can be evaluated by saying:

LAST=$(ls -ltr $DIR/*/soup-model.keras | awk -F"/" "{print \$2}" | tail -1)
./testImages.py -a 17 -b 256 -g $GPU -i $SIZE -m $DIR"/"$LAST"/soup-model.keras" -t herbariograph-96-test.tfr
# Test loss: 0.8199
# Test accuracy: 96.11%
# Test AUCPR: 98.19%
# Test macro F1: 96.11%

USING THE TRAINED MODELS

Both a distilled version (ConvNeXt-N [4M]) of the best performing model and the published model (ConvNeXt-T [106M]) are available. Unknown images can be categorized by saying:

### download models (only needed the first time)
wget https://github.com/dpl10/herbariograph/raw/refs/heads/main/ConvNeXt-distilled.keras 
### infer
UNKNOWN="image-directory"
OUTPUT="file.tsv"
./predictImages.py -b 1024 -g $GPU -d $UNKNOWN -m ConvNeXt-N -o $OUTPUT -p 4

CITATION

If you use the Herbariograph dataset or models in your work, please cite us.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
manual		manual
.gitignore		.gitignore
ConvNeXt-distilled.keras		ConvNeXt-distilled.keras
LICENSE		LICENSE
LayerScale.py		LayerScale.py
README.md		README.md
convnext.py		convnext.py
convnextClassifier.py		convnextClassifier.py
corruptions.py		corruptions.py
dhashr.py		dhashr.py
distillOptimizeImagesR.py		distillOptimizeImagesR.py
downloadResizeImage.py		downloadResizeImage.py
imageSoup.py		imageSoup.py
lm2crop.py		lm2crop.py
predictImages.py		predictImages.py
testImages.py		testImages.py
trainImagesC.py		trainImagesC.py
trainImagesC4096.py		trainImagesC4096.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DATASET

IMAGES

CODE

USING THE TRAINED MODELS

CITATION

About

Uh oh!

Releases 1

Packages

Languages

License

dpl10/herbariograph

Folders and files

Latest commit

History

Repository files navigation

DATASET

IMAGES

CODE

USING THE TRAINED MODELS

CITATION

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages