Official Pytorch implementation of MonoLoT: Self-Supervised Monocular Depth Estimation in Low-Texture Scenes for Automatic Robotic Endoscopy.
Qi HE, Guang Feng, Sophia Bano, Danail Stoyanov, Siyang Zuo
[IEEExplore][YouTube] [Github]
Our research has introduced an innovative approach that addresses the challenges associated with self-supervised monocular depth estimation in digestive endoscopy. We have addressed two critical aspects: the limitations of self-supervised depth estimation in low-texture scenes and the application of depth estimation in visual servoing for digestive endoscopy. Our investigation has revealed that the struggles of self-supervised depth estimation in low-texture scenes stem from inaccurate photometric reconstruction losses. To overcome this, we have introduced the point-matching loss, which refines the reprojected points. Furthermore, during the training process, data augmentation is achieved through batch image shuffle loss, significantly improving the accuracy and generalisation capability of the depth model. The combined contributions of the point matching loss and batch image shuffle loss have boosted the baseline accuracy by a minimum of 5% on both the C3VD and SimCol datasets, surpassing the generalisability of ground truth depth-supervised baselines when applied to upper-GI datasets. Moreover, the successful implementation of our robotic platform for automatic intervention in digestive endoscopy demonstrates the practical and impactful application of monocular depth estimation technology.
We tested our code on a server with Ubuntu 18.04.6, cuda 11.1, gcc 7.5.0
- Clone project
$ git clone https://github.com/howardchina/MonoLoT.git
$ cd MonoLoT
- Install environment
$ conda create --name monolot --file requirements.txt
$ conda activate monolot
First, create a data/ folder inside the project path by
$ mkdir data
The data structure will be organised as follows:
$ tree data
data
βββ c3vd_v2
βΒ Β βββ imgs -> <c3vd_v2_img_dir>
βΒ Β βββ matcher_results
βΒ Β βββ test.npy
βΒ Β βββ train.npy
βΒ Β βββ val.npy
βββ simcol_complete
βββ imgs -> <simcol_img_dir>
βββ matcher_results
βββ test_352x352.npy
βββ train_352x352.npy
βββ val_352x352.npy
...
Second, some image preprocessings are necessery such as undistort, filter static frames, and data split.
Take c3vd_v2 for instance, the following script should be processing:
- undistort frames:
playground\heqi\C3VD\data_preprocessing.ipynb - (optional) filter static frames and data split by yourself:
playground\heqi\C3VD\gen_split.ipynb - (optional) generate
matcher_resultsby yourself:playground\heqi\C3VD\gen_corres.ipynb
(optimal) Similar image preprocessings should be applied to simcol_complete as well, check playground\heqi\Simcol_complete.
We provide two ways for generating matching results saved in matcher_results folders.
- (recommend) download
matcher_resultsofc3vd_v2andsimcol_completefrom here - (not recommend) generate the
matcher_resultsby yourself using notebooks mentioned above such asplayground\heqi\C3VD\gen_corres.ipynb. As this process will take about 2-4 hours.
Soft link (->) the well-prepared image folders to this workspace.
The image folder of c3vd_v2 (<c3vd_v2_img_dir>) will be organised as follows:
$ cd c3vd_v2
$ tree imgs
imgs
βββ cecum_t1_a_under_review
βΒ Β βββ 0000_color.png
βΒ Β βββ 0000_depth.tiff
βΒ Β βββ 0001_color.png
βΒ Β βββ 0001_depth.tiff
βΒ Β βββ 0002_color.png
βΒ Β βββ 0002_depth.tiff
βΒ Β βββ 0003_color.png
βΒ Β βββ 0003_depth.tiff
βΒ Β βββ 0004_color.png
βΒ Β βββ 0004_depth.tiff
βΒ Β βββ ...
βββ cecum_t1_b_under_review
βΒ Β βββ ...
βββ cecum_t2_a_under_review
βββ cecum_t2_b_under_review
βββ cecum_t2_c_under_review
βββ cecum_t3_a_under_review
βββ cecum_t4_a_under_review
βββ cecum_t4_b_under_review
βββ desc_t4_a_under_review
βββ sigmoid_t1_a_under_review
βββ sigmoid_t2_a_under_review
βββ sigmoid_t3_a_under_review
βββ sigmoid_t3_b_under_review
βββ trans_t1_a_under_review
βββ trans_t1_b_under_review
βββ trans_t2_a_under_review
βββ trans_t2_b_under_review
βββ trans_t2_c_under_review
βββ trans_t3_a_under_review
βββ trans_t3_b_under_review
βββ trans_t4_a_under_review
βββ trans_t4_b_under_review
The image folder of simcol_complete (<simcol_img_dir>) will be organised as follows:
$ cd ..
$ cd simcol_complete
$ tree imgs
imgs
βββ SyntheticColon_I
βΒ Β βββ Test_labels
βΒ Β βΒ Β βββ Frames_S10
βΒ Β βΒ Β βΒ Β βββ Depth_0000.png
βΒ Β βΒ Β βΒ Β βββ Depth_0001.png
βΒ Β βΒ Β βΒ Β βββ Depth_0002.png
βΒ Β βΒ Β βΒ Β βββ Depth_0003.png
βΒ Β βΒ Β βΒ Β βββ Depth_0004.png
βΒ Β βΒ Β βΒ Β βββ Depth_0005.png
βΒ Β βΒ Β βΒ Β ...
βΒ Β βΒ Β βΒ Β βββ Depth_1200.png
βΒ Β βΒ Β βΒ Β βββ FrameBuffer_0000.png
βΒ Β βΒ Β βΒ Β βββ FrameBuffer_0001.png
βΒ Β βΒ Β βΒ Β βββ FrameBuffer_0002.png
βΒ Β βΒ Β βΒ Β βββ FrameBuffer_0003.png
βΒ Β βΒ Β βΒ Β βββ FrameBuffer_0004.png
βΒ Β βΒ Β βΒ Β βββ FrameBuffer_0005.png
βΒ Β βΒ Β βΒ Β ...
βΒ Β βΒ Β βΒ Β βββ FrameBuffer_1200.png
βΒ Β βΒ Β βββ Frames_S15
βΒ Β βΒ Β βββ Frames_S5
βΒ Β βββ Train
βΒ Β βΒ Β βββ Frames_S1
βΒ Β βΒ Β βββ Frames_S11
βΒ Β βΒ Β βββ Frames_S12
βΒ Β βΒ Β βββ Frames_S13
βΒ Β βΒ Β βββ Frames_S2
βΒ Β βΒ Β βββ Frames_S3
βΒ Β βΒ Β βββ Frames_S6
βΒ Β βΒ Β βββ Frames_S7
βΒ Β βΒ Β βββ Frames_S8
βΒ Β βββ Val
βΒ Β βββ Frames_S14
βΒ Β βββ Frames_S4
βΒ Β βββ Frames_S9
βββ SyntheticColon_II
βΒ Β βββ Test_labels
βΒ Β βΒ Β βββ Frames_B10
βΒ Β βΒ Β βββ Frames_B15
βΒ Β βΒ Β βββ Frames_B5
βΒ Β βββ Train
βΒ Β βΒ Β βββ Frames_B1
βΒ Β βΒ Β βββ Frames_B11
βΒ Β βΒ Β βββ Frames_B12
βΒ Β βΒ Β βββ Frames_B13
βΒ Β βΒ Β βββ Frames_B2
βΒ Β βΒ Β βββ Frames_B3
βΒ Β βΒ Β βββ Frames_B6
βΒ Β βΒ Β βββ Frames_B7
βΒ Β βΒ Β βββ Frames_B8
βΒ Β βββ Val
βΒ Β βββ Frames_B14
βΒ Β βββ Frames_B4
βΒ Β βββ Frames_B9
βββ SyntheticColon_III
βββ Test_labels
βΒ Β βββ Frames_O1
βΒ Β βββ Frames_O2
βΒ Β βββ Frames_O3
βββ Train
For both training and inference:
For inference only:
- The UpperGI dataset is available in Nutstore.
- The EndoSLAM dataset is available in Mendeley.
- The EndoMapper dataset is available in Synapse.
To train scenes, we provide the following training scripts saved at experiments folder:
Table IV: train and test on C3VD (with training scripts saved at experiments\c3vd_v2\ folder)
| Monodepth2 (baseline) | D | supervised_c3vd_v2_monodepth2.yml |
| Lite-mono (baseline) | D | supervised_c3vd_v2_litemono.yml |
| MonoViT (baseline) | D | supervised_c3vd_v2_monovit.yml |
| Monodepth2 (baseline) | M | baseline_c3vd_v2_monodepth2.yml |
| Lite-mono (baseline) | M | baseline_c3vd_v2_litemono.yml |
| MonoViT (baseline) | M | baseline_c3vd_v2_monovit.yml |
| Monodepth2 + |
M | RCC_matching_cropalign_c3vd_v2_monodepth2.yml |
| Lite-mono + |
M | RC_matching_c3vd_v2_litemono.yml |
| MonoViT + |
M | RC_matching_c3vd_v2_monovit.yml |
Table V: train and test on SimCol (with training scripts saved at experiments\simcol_complete\ folder)
| Monodepth2 (baseline) | D | supervised_simcol_complete_monodepth2.yml |
| Lite-mono (baseline) | D | supervised_simcol_complete_litemono.yml |
| MonoViT (baseline) | D | supervised_simcol_complete_monovit.yml |
| Monodepth2 (baseline) | M | baseline_simcol_complete_monodepth2.yml |
| Lite-mono (baseline) | M | baseline_simcol_complete_litemono.yml |
| MonoViT (baseline) | M | baseline_simcol_complete_monovit.yml |
| Monodepth2 + |
M | RCC_cropalign_matching_simcol_complete_monodepth2.yml |
| Lite-mono + |
M | RCC_cropalign_matching_simcol_complete_litemono.yml |
| MonoViT + |
M | RC_matching_simcol_complete_monovit.yml |
Table VI: ablation study on C3VD (with training scripts saved at experiments\ablation_c3vd_v2\ folder)
| Monodepth2 | baseline | baseline_c3vd_v2_monodepth2.yml |
||
| β | RC_baseline_c3vd_v2_monodepth2.yml |
|||
| β | RCC_cropalign_c3vd_v2_monodepth2.yml |
|||
| β | matching_c3vd_v2_monodepth2.yml |
|||
| β | β | RC_matching_c3vd_v2_monodepth2.yml |
||
| β | β | RCC_matching_cropalign_c3vd_v2_monodepth2.yml |
||
| Lite-mono | baseline | baseline_c3vd_v2_litemono.yml |
||
| β | RC_baseline_c3vd_v2_litemono.yml |
|||
| β | RCC_cropalign_c3vd_v2_litemono.yml |
|||
| β | matching_c3vd_v2_litemono.yml |
|||
| β | β | RC_matching_c3vd_v2_litemono.yml |
||
| β | β | RCC_cropalign_matching_c3vd_v2_litemono.yml |
||
| MonoViT | baseline | baseline_c3vd_v2_monovit.yml |
||
| β | RC_baseline_c3vd_v2_monovit.yml |
|||
| β | RCC_cropalign_c3vd_v2_monovit.yml |
|||
| β | matching_c3vd_v2_monovit.yml |
|||
| β | β | RC_matching_c3vd_v2_monovit.yml |
||
| β | β | RCC_cropalign_matching_c3vd_v2_monovit.yml |
Table VII: ablation study on SimCol (with training scripts saved at experiments\ablation_simcol_complete\ folder)
| Monodepth2 | baseline | baseline_simcol_complete_monodepth2.yml |
||
| β | RC_simcol_complete_monodepth2.yml |
|||
| β | RCC_cropalign_simcol_complete_monodepth2.yml |
|||
| β | matching_simcol_complete_monodepth2.yml |
|||
| β | β | RC_matching_simcol_complete_monodepth2.yml |
||
| β | β | RCC_cropalign_matching_simcol_complete_monodepth2.yml |
||
| Lite-mono | baseline | baseline_simcol_complete_litemono.yml |
||
| β | RC_simcol_complete_litemono.yml |
|||
| β | RCC_cropalign_simcol_complete_litemono.yml |
|||
| β | matching_simcol_complete_litemono.yml |
|||
| β | β | RC_matching_simcol_complete_litemono.yml |
||
| β | β | RCC_cropalign_matching_simcol_complete_litemono.yml |
||
| MonoViT | baseline | baseline_simcol_complete_monovit.yml |
||
| β | RC_baseline_simcol_complete_monovit.yml |
|||
| β | RCC_cropalign_simcol_complete_monovit.yml |
|||
| β | matching_simcol_complete_monovit.yml |
|||
| β | β | RC_matching_simcol_complete_monovit.yml |
||
| β | β | RCC_cropalign_matching_simcol_complete_monovit.yml |
run them with
CUDA_VISIBLE_DEVICES=0 python run.py --config experiments/<file_name>.yml --cfg_params '{"model_name": "<exp_name>"}' --seed 1243 --gpu 0 --num_workers 4
The code will automatically run training.
- Training will be recorded in
resultsfolder. - log file will be saved in
results/<exp_name>/logs, which can be watched by tensorboard - checkpoints will be saved in
results/<exp_name>/models
Evaluation:
-
Please refers to
playground\heqi\eval_c3vd.ipynbandplayground\heqi\eval_simcol_complete_align_with_paper.ipynb. -
Existing
checkpointsare avilable at Nutstore. -
If you would like to further visualise these models, our codes for
visulisationare also provide to download for reference at Nutstore.
- Qi HE: howard@tju.edu.cn
If you find our work helpful, please consider citing:
@ARTICLE{10587075,
author={He, Qi and Feng, Guang and Bano, Sophia and Stoyanov, Danail and Zuo, Siyang},
journal={IEEE Journal of Biomedical and Health Informatics},
title={MonoLoT: Self-Supervised Monocular Depth Estimation in Low-Texture Scenes for Automatic Robotic Endoscopy},
year={2024},
pages={1-14},
keywords={Estimation;Endoscopes;Training;Data models;Robots;Feature extraction;Image reconstruction;Monocular depth estimation;automatic intervention;digestive endoscopy},
doi={10.1109/JBHI.2024.3423791}}This work is licensed under CC BY-NC-SA 4.0
- We thank all authors from C3VD for presenting such an excellent work.
- We thank all authors from SimCol3D for presenting such an excellent work.
- We thank all authors from EndoSLAM for presenting such an excellent work.
- We thank all authors from EndoMapper for presenting such an excellent work.

