Segment net - 2D Image Segmentation Framework with Custom Architectures

This repository presents a flexible and modular framework for semantic segmentation using various transformer-based, convolutional and hybrid backbones. It supports multiple attention mechanisms, decoder designs, loss functions, and real-world datasets (urban and off-road), enabling thorough experimentation and benchmarking.

Introduction and Goals · Project Structure · Architecture Details · Datasets · Installation and Usage · Model Attributes Overview · Results · Contribution · License

1. Introduction and Goals 📊

Semantic segmentation is a core task for autonomous vehicle perception systems. In real-world applications, the ability to understand and classify each region of an image is crucial for safe and autonomous navigation, especially in unstructured environments.

In off-road scenarios, perception systems must deal with:

🔹 Irregular terrains and unstructured surfaces (mud, grass, rubble).
🔹 Ambiguous class boundaries and visual noise.
🔹 High intra-class variability and illumination changes.
🔹 Severe class imbalance in available datasets (e.g., RELLIS-3D, RUGD).

Transformers have emerged as powerful alternatives to traditional CNNs, offering better global context modeling and higher segmentation accuracy. However, their behavior in off-road segmentation tasks is still underexplored.

To validate the effectiveness of the proposed framework and architectures in real-world off-road scenarios, we conducted extensive experiments using the RELLIS-3D dataset. Quantitative and qualitative results, including per-class performance metrics and visual segmentation outputs, are presented in the Results section below.

🎯 Objectives of This Work

🔹 Evaluate transformer-based segmentation backbones under off-road conditions.
🔹 Build a modular and configurable framework for segmentation research.
🔹 Allow easy switching between backbones, decoders, losses, and attention types.
🔹 Provide a reproducible baseline using the RELLIS-3D dataset.
🔹 Facilitate benchmarking and experimentation for off-road autonomous navigation.
🔹 Report and analyze experimental results obtained using the RELLIS-3D dataset, providing both quantitative benchmarks and qualitative visual comparisons.

2. Project Structure 📂

The repository is organized as follows:

SEGMENT_NET/
│
├── cfg/              # Configuration files (.ini) for training/testing
├── env/              # Environment setup scripts
├── experiments       # Configuration files used for the article's experiments
├── images/           # Architecture diagrams, moisaic image and confusion matrices files
├── logs/             # Training logs
├── params/           # Model parameters, checkpoints, or derived configs
├── utils/            # Helper functions and support scripts
│
├── class_weights.py  # Class balancing script
├── run.py            # Main pipeline script
├── run_all.sh        # Shell script to enable automatic training sequences
├── visualize_modelANDconfusion_matrices.ipynb # Jupyter notebook to allow confusion matrices plots and inference visualization and saving on files.
├── .gitignore
├── LICENSE
└── README.md

3. Architecture Details 🧠

🔩 Supported Backbones

The proposed segmentation framework was designed to be modular, extensible, and experiment-friendly. It supports a broad range of architectural components, enabling researchers and developers to combine different backbones, decoders, attention mechanisms, feature aggregation strategies, and loss functions with ease. This flexibility allows for systematic exploration of design choices and facilitates fair comparisons across models and datasets, particularly in challenging off-road scenarios.

🧱 Convolutional Backbones

resnet18
mobilenetv3
efficientnetb0
deeplabv3_mobilenetv3
convnextv2

🧠 Transformer-based Backbones

deit3_small
sam2_hiera
pitxs
segformerb0
levit
tinyvit
fastvit

⚡ Hybrid Backbones (Conv + Transformer)

mobilevit
efficientformer
maxxvitv2
edgenext

🧱 Feature Pyramid Aggregation (FPN)

sum
concat
weighted_sum
max_pool

🎯 Class-wise Attention Mechanisms

none
spatial
query
class_channel
se_channel

🧠 Decoder Variants

se_conv_interp
depthwise_nn
transformer

📉 Supported Loss Functions

dice
focal_dice
cross_entropy
focal_cross_entropy
lovasz_softmax
boundary_dice
hausdorff_dt_dice

4. Datasets 🌍

This segmentation framework offers built-in support for several standard public datasets, covering both urban and off-road scenarios, enabling fair comparison and flexible experimentation. Configuration files (e.g. in cfg/) can simply reference any of these datasets to run training, validation, or testing.

Dataset	Environment	Annotations	Key Features
A2D2 (Audi Autonomous Driving Dataset)	Diverse urban + highway (Germany)	RGB + LiDAR + 2D/3D semantic masks	~41 k segmented images (41 labels); includes both semantic and 3D box annotations. Multi-sensor platform with 5 LiDARs and 6 cameras
RELLIS‑3D	Off‑road tracks and terrain	RGB images + LiDAR scans with per‑pixel labels	6 235 labeled frames (from ~13 k synchronized LiDAR+camera); 19 semantic classes including grass, sky, rubble, and vehicle. Auto‑focus on class‑imbalance and irregular terrain
RUGD (Robot Unstructured Ground Driving)	Natural off‑road (trails, parks, creeks, villages)	RGB images with semantic masks	7 436 images and 24 classes (e.g., tree, fence, vehicle, puddle, gravel, concrete). Split: ~4,779 train / 1,964 test / 733 val
GOOSE (German Outdoor and Offroad Dataset)	Unstructured outdoor robotics environments	RGB + NIR images and annotated point clouds	10 000+ paired image and LiDAR frames; supports fine-grained class ontology over unstructured terrain. Includes open-source tools and evaluation challenges
BDD100K (Berkeley DeepDrive)	Diverse urban drives (USA)	1280×720 RGB with pixel-level segmentation	10 semantic instance segmentation classes (e.g., car, pedestrian, truck). ~10K annotated images; diverse weather, lighting, and traffic scenes

5. Installation and Usage ⚙️

📋 Requirements

This project requires a Python 3.8 environment with PyTorch 2.4.1 and CUDA 12.1. All core dependencies and libraries are installed via a single automated setup script. The main packages used include:

torch==2.4.1 · torchvision==0.19.1 · torchaudio==2.4.1
albumentations, transformers, monai, timm
OpenCV, Matplotlib, Scikit-learn, Plotly, Dash
PyTorch Geometric (with CUDA wheels)

Ensure you have Anaconda/Miniconda installed and a compatible NVIDIA GPU with CUDA 12.1 support for full functionality.

📦 Environment Setup (via shell script)

To automatically install the required packages and create the environment:

bash env/create_env.sh

This will:

✅ Create a conda environment named pytorch-env
✅ Install Python 3.8.20, PyTorch 2.4.1, TorchVision, TorchAudio (CUDA 12.1)
✅ Install required libraries via pip (Albumentations, MONAI, Transformers, etc.)
✅ Install PyTorch Geometric with precompiled CUDA wheels

Then activate the environment:

conda activate pytorch-env

🚀 Running the Pipeline

Choose a config file Use one of the .ini files in the cfg/ folder (e.g., rellis3d_dev.ini).
Edit configuration Update the following parameters inside the file:
- type (backbone type, e.g., segformerb0)
- mode (train, resume, or test)
- dataset (e.g., rellis3d, rugd, bdd100k)
- Optionally: loss function, learning rate, attention, batch size, etc.

▶ Training

python run.py --cfg cfg/rellis3d_dev.ini

Ensure mode = train in the .ini file.

✅ Resume Training

python run.py --cfg cfg/rellis3d_dev.ini

Ensure mode = resume to continue from the last checkpoint.

🧪 Testing

python run.py --cfg cfg/rellis3d_dev.ini

Ensure mode = test in the configuration file.

6. Model Attributes Overview 🧠

Model	Batch size	# Params (MM)	Inference time (ms)	Features vector (len)
MobileViT	32	19.22	0.0245	6
MaxVit	16	116.62	0.0031	5
EfficientFormer	16	5.26	0.0132	4
TinyViT	32	12.41	0.0007	5
SegFormer	32	4.53	0.0029	4
PiT	64	12.30	0.0015	4
SAM 2	32	28.37	0.0280	5
FastVit	32	11.84	0.0007	5
EdgeNeXt	32	38.99	0.0008	5

7. Results 🚀

7.1 Quantitative Results 📊

Models	sky	grass	tree	bush	concrete	mud	person	puddle	rubble	barrier	fence	vehicle	object	pole	water	asphalt	building	mIoU	Acc (%)
MobileViT	0.961 [best]	0.862	0.695	0.675	0.753	0.300	0.771 [best]	0.595	0.297	0.214	0.297 [best]	0.355 [best]	0.245 [best]	0.106 [best]	0.000 [worst]	0.000 [worst]	0.104 [best]	0.402	89.68
MaxViT	0.961	0.879 [best]	0.736	0.727	0.764	0.343 [best]	0.639	0.664	0.498 [best]	0.204	0.205	0.244	0.121	0.106	0.230 [best]	0.006	0.032	0.409 [best]	91.02 [best]
EfficientFormer	0.954	0.854	0.708	0.674	0.745	0.319	0.561	0.560	0.088	0.183	0.185	0.179	0.003	0.061	0.000	0.004	0.044	0.340	89.42
TinyViT	0.953	0.871	0.706	0.698	0.739	0.333	0.423 [worst]	0.710 [best]	0.345	0.150 [worst]	0.005 [worst]	0.112	0.000 [worst]	0.064	0.000	0.000	0.007 [worst]	0.340	90.08
SegFormer	0.961	0.846	0.748 [best]	0.639	0.753	0.295	0.626	0.633	0.210	0.164	0.173	0.239	0.002	0.083	0.000	0.019	0.042	0.357	89.38
PiT	0.952 [worst]	0.864	0.697	0.678	0.700 [worst]	0.291 [worst]	0.505	0.625	0.136	0.181	0.144	0.104 [worst]	0.018	0.038	0.000	0.003	0.029	0.331	89.40
SAM 2	0.960	0.719 [worst]	0.666 [worst]	0.545 [worst]	0.783 [best]	0.302	0.540	0.443 [worst]	0.319	0.207	0.098	0.128	0.000	0.000 [worst]	0.000	0.067 [best]	0.026	0.322 [worst]	83.90 [worst]
FastViT	0.956	0.873	0.723	0.709	0.716	0.322	0.502	0.646	0.027 [worst]	0.191	0.127	0.128	0.038	0.000	0.000	0.000	0.012	0.332	90.38
EdgeNeXt	0.955	0.878	0.743	0.732 [best]	0.762	0.311	0.582	0.708	0.373	0.229 [best]	0.203	0.212	0.127	0.074	0.000	0.002	0.016	0.384	91.01

7.2 Qualitative Results 🌟

Visual inspection of segmentation outputs is essential for understanding how models interpret complex scenes, handle object boundaries, and manage class ambiguity, particularly in challenging and unstructured environments.

The mosaic below presents representative RGB inputs, ground truth masks, and the predictions from the three top performing models: MobileViT, MaxViT, and EdgeNeXt. Placing these outputs side by side allows a direct comparison of how each architecture addresses fine grained details, preserves small objects, and avoids common segmentation pitfalls.

This visual comparison reveals typical failure modes such as boundary misalignment, omission of small or thin structures, and texture related artifacts, while also illustrating each model’s strengths. Mosaic style layouts are widely used in semantic segmentation literature as an effective way to summarize qualitative performance across multiple classes and scene types.

7.3 Confusion Matrices 🔁

While global metrics such as mIoU and overall accuracy provide a high-level view of model performance, confusion matrices reveal a deeper layer of understanding by showing how each class is predicted relative to the ground truth. For semantic segmentation in complex off-road environments, class-specific error analysis is particularly valuable, as datasets often suffer from severe class imbalance, visual ambiguity, and small or thin object instances.

Below are the row-normalized confusion matrices for the three best-performing architectures (MobileViT, MaxViT, and EdgeNeXt) and the worst-performing model (SAM 2). The values are normalized per true class (row), meaning that each row sums to 1. This allows direct interpretation of recall rates: darker diagonal cells indicate higher correct predictions for that class, while off-diagonal cells show confusion with other classes.

Arranging them in a 2×2 mosaic enables an immediate visual comparison of how the strongest and weakest models differ in their ability to recognize specific classes under the same test set conditions.

MobileViT	MaxViT
EdgeNeXt	SAM 2

The top three models display stronger diagonal dominance, indicating higher recall across most classes and reduced confusion between categories. MobileViT maintains stable recognition of dominant classes, MaxViT shows improved separation in terrain-related categories, and EdgeNeXt achieves balanced performance across frequent and rare classes. In contrast, SAM 2 suffers from significant misclassifications, particularly in minority and thin-object categories, relying heavily on predicting the most common classes.

8. Contribution 🤝

Contributions are welcome! If you have suggestions, feature requests, or improvements, feel free to:

Open an Issue
Submit a Pull Request

9. License 📜

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Segment net - 2D Image Segmentation Framework with Custom Architectures

1. Introduction and Goals 📊

🎯 Objectives of This Work

2. Project Structure 📂

3. Architecture Details 🧠

🔩 Supported Backbones

🧱 Convolutional Backbones

🧠 Transformer-based Backbones

⚡ Hybrid Backbones (Conv + Transformer)

🧱 Feature Pyramid Aggregation (FPN)

🎯 Class-wise Attention Mechanisms

🧠 Decoder Variants

📉 Supported Loss Functions

4. Datasets 🌍

5. Installation and Usage ⚙️

📋 Requirements

📦 Environment Setup (via shell script)

🚀 Running the Pipeline

▶ Training

✅ Resume Training

🧪 Testing

6. Model Attributes Overview 🧠

7. Results 🚀

7.1 Quantitative Results 📊

7.2 Qualitative Results 🌟

7.3 Confusion Matrices 🔁

8. Contribution 🤝

9. License 📜

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
cfg		cfg
env		env
experiments		experiments
images		images
params		params
results_cm		results_cm
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
class_weights.py		class_weights.py
mask_color_convertion.py		mask_color_convertion.py
run.py		run.py
run_all.sh		run_all.sh
split_dataset.py		split_dataset.py
view_attention_map.py		view_attention_map.py
visualize_modelANDconfusion_matrices.ipynb		visualize_modelANDconfusion_matrices.ipynb

Folders and files

Latest commit

History

Repository files navigation

Segment net - 2D Image Segmentation Framework with Custom Architectures

1. Introduction and Goals 📊

🎯 Objectives of This Work

2. Project Structure 📂

3. Architecture Details 🧠

🔩 Supported Backbones

🧱 Convolutional Backbones

🧠 Transformer-based Backbones

⚡ Hybrid Backbones (Conv + Transformer)

🧱 Feature Pyramid Aggregation (FPN)

🎯 Class-wise Attention Mechanisms

🧠 Decoder Variants

📉 Supported Loss Functions

4. Datasets 🌍

5. Installation and Usage ⚙️

📋 Requirements

📦 Environment Setup (via shell script)

🚀 Running the Pipeline

▶ Training

✅ Resume Training

🧪 Testing

6. Model Attributes Overview 🧠

7. Results 🚀

7.1 Quantitative Results 📊

7.2 Qualitative Results 🌟

7.3 Confusion Matrices 🔁

8. Contribution 🤝

9. License 📜

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages