CIFAR-10 Computer Vision Demo with Docker

A comprehensive demonstration of deep learning for computer vision using PyTorch, Docker, and modern MLOps practices. This project trains a Convolutional Neural Network (CNN) on the CIFAR-10 dataset with production-ready training practices.

Overview

This project demonstrates:

Deep Learning: CNN architecture for image classification
PyTorch Best Practices: Mixed precision training, gradient clipping, learning rate scheduling
Containerization: Docker for reproducible environments
Experiment Tracking: TensorBoard for visualization
MLOps: Automated testing, checkpointing, and model deployment

Dataset: CIFAR-10

The CIFAR-10 dataset consists of 60,000 32x32 color images in 10 classes:

✈️ Airplane
🚗 Automobile
🐦 Bird
🐱 Cat
🦌 Deer
🐕 Dog
🐸 Frog
🐴 Horse
🚢 Ship
🚚 Truck

Split: 50,000 training images + 10,000 test images

Features

Modern Training Techniques

✅ Mixed Precision Training (AMP) for faster computation
✅ OneCycleLR Scheduler for optimal learning rate scheduling
✅ Gradient Clipping to prevent exploding gradients
✅ Label Smoothing for better generalization
✅ Early Stopping with patience to prevent overfitting
✅ Comprehensive Checkpointing with state recovery
✅ Data Augmentation (RandomCrop, RandomHorizontalFlip, Normalization)

Production Features

🐳 Docker Support for reproducible environments
📊 TensorBoard Integration for real-time monitoring
💾 Automatic Model Saving (best model + last checkpoint)
🔄 Docker Compose for easy orchestration
📦 UV Package Manager for fast dependency management
🚀 TorchServe Integration for production model serving
🔌 REST API for real-time inference

Prerequisites

Docker (version 20.10+)
Docker Compose (version 2.0+)
Optional: Docker Hub account (for publishing)

No Python installation required! Everything runs in Docker containers.

Quick Start

Option 1: Using Docker Compose (Recommended)

Clone the repository

git clone https://github.com/semilleroCV/demo-docker.git
cd demo-docker

Train the model
```
docker compose up train
```
Test the model
```
docker compose up test
```

Option 2: Using Docker directly

# Build the image
docker build -t cifar10-training .

# Run training
docker run --rm -v $(pwd)/checkpoints:/app/checkpoints \
                -v $(pwd)/runs:/app/runs \
                -v $(pwd)/data:/app/data \
                cifar10-training python train.py

# Run testing
docker run --rm -v $(pwd)/checkpoints:/app/checkpoints \
                -v $(pwd)/data:/app/data \
                cifar10-training python test.py

Project Structure

cv_demo/
├── 📄 train.py              # Training script with best practices
├── 📄 test.py               # Evaluation script
├── 📄 compare_improvements.py  # Compare training approaches
├── 📂 model/                # Model architecture and utilities
│   ├── __init__.py
│   ├── net.py              # CNN architecture definition
│   └── loader.py           # Data loading and preprocessing
├── 📂 checkpoints/          # Saved model checkpoints
│   ├── best_model.pth      # Best model (highest val accuracy)
│   ├── last_model.pth      # Latest checkpoint
│   └── cifar_net.pth       # Legacy format (compatibility)
├── 📂 data/                 # CIFAR-10 dataset (auto-downloaded)
├── 📂 runs/                 # TensorBoard logs
├── � model-store/          # TorchServe model archives (.mar files)
├── 📂 torchserve-config/    # TorchServe configuration
│   └── config.properties   # Server settings
├── �🐳 Dockerfile           # Container definition
├── 🐳 compose.yml          # Docker Compose configuration
├── 📄 pyproject.toml       # Python dependencies
├── 📄 cifar10_handler.py   # TorchServe custom handler
├── 📜 create_model_archive.sh  # Model packaging script
├── 📜 test_torchserve.py   # TorchServe testing utility
└── 📜 publish_docker.sh    # Docker Hub publishing script

Usage

Training

The training script implements state-of-the-art practices:

docker compose up train

Training Configuration (see train.py):

Batch Size: 128
Epochs: 100 (with early stopping)
Optimizer: AdamW (lr=3e-4, weight_decay=0.01)
Scheduler: OneCycleLR with cosine annealing
Loss: CrossEntropyLoss with label smoothing (0.1)
Regularization: Dropout (0.3) + L2 regularization

Features:

Automatic mixed precision training (AMP)
Gradient clipping (max_norm=1.0)
Comprehensive checkpointing with full state recovery
Early stopping (patience=15 epochs)
Real-time logging to TensorBoard

Testing

Evaluate the trained model on the test set:

docker compose up test

Output includes:

Overall accuracy
Per-class accuracy breakdown
Sample predictions visualization

Local Development (Without Docker)

If you prefer to run locally:

# Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

# Run training
uv run python train.py

# Run testing
uv run python test.py

Training Details

Model Architecture

Custom CNN with modern design:

Conv2d(3, 32, 3x3) → BatchNorm → ReLU → MaxPool(2x2)
Conv2d(32, 64, 3x3) → BatchNorm → ReLU → MaxPool(2x2)
Conv2d(64, 128, 3x3) → BatchNorm → ReLU → MaxPool(2x2)
Flatten
Linear(2048 → 512) → BatchNorm → ReLU → Dropout(0.3)
Linear(512 → 256) → BatchNorm → ReLU → Dropout(0.3)
Linear(256 → 10)

Total Parameters: ~2.7M trainable parameters

Data Augmentation

Training transforms:

Random Crop (32x32 with padding=4)
Random Horizontal Flip (p=0.5)
Normalization (ImageNet stats)

Test transforms:

Normalization only

Learning Rate Schedule

OneCycleLR strategy:

Warmup (30% of training): lr increases from 3e-6 to 3e-3
Decay (70% of training): lr decreases to 3e-10 with cosine annealing

This provides:

Fast initial convergence
Escape from local minima
Fine-tuning in later epochs

Monitoring with TensorBoard

View training progress in real-time using the integrated TensorBoard service:

Option 1: Using Docker Compose (Recommended)

# Start TensorBoard (runs in background)
docker compose --profile monitoring up -d tensorboard

# Now run training (TensorBoard will update in real-time)
docker compose up train

# Stop TensorBoard when done
docker compose --profile monitoring down

Option 2: Standalone TensorBoard

# Start TensorBoard manually
docker run --rm -p 6006:6006 \
    -v $(pwd)/runs:/app/runs \
    tensorflow/tensorflow:latest \
    tensorboard --logdir=/app/runs --host=0.0.0.0

Accessing TensorBoard

Open http://localhost:6006 in your browser to see:

Scalars: Training/validation loss and accuracy curves
Graphs: Model architecture visualization
Distributions: Weight and gradient histograms
Time Series: Learning rate schedule
Images: Sample predictions (if configured)

Pro Tip: Start TensorBoard before training to see metrics update in real-time!

Docker Hub Deployment

Publish your trained model to Docker Hub:

# Make script executable
chmod +x publish_docker.sh

# Run publishing script
./publish_docker.sh [version]

# Example: publish as version v1.0
./publish_docker.sh v1.0

# Or publish as latest
./publish_docker.sh

The script will:

✅ Check Docker authentication
✅ Build the image
✅ Tag appropriately (converts username to lowercase)
✅ Optional: Test locally
✅ Push to Docker Hub

Pull published image:

docker pull <your-username>/cifar10-training:latest

Model Serving with TorchServe

Deploy your trained model as a production REST API using TorchServe!

Step 1: Create Model Archive

After training, package your model for TorchServe:

# Make script executable
chmod +x create_model_archive.sh

# Create .mar file (model archive)
./create_model_archive.sh

This creates model-store/cifar10_classifier.mar - a packaged model ready for deployment.

Step 2: Start TorchServe

# Start TorchServe server
docker compose --profile serving up -d torchserve

# Check if server is running
curl http://localhost:8080/ping

Step 3: Register Your Model

# Register the model
curl -X POST "http://localhost:8081/models?url=cifar10_classifier.mar"

# Verify registration
curl http://localhost:8081/models

Step 4: Make Predictions

Using Python script:

# Install requests library
pip install requests

# Make prediction
python test_torchserve.py --image path/to/image.jpg

# Check model status
python test_torchserve.py --status

# List all models
python test_torchserve.py --list

Using curl:

# Predict from image file
curl -X POST http://localhost:8080/predictions/cifar10_classifier \
     -T path/to/image.jpg

# Get prediction with detailed response
curl -X POST http://localhost:8080/predictions/cifar10_classifier \
     -T path/to/image.jpg | jq

TorchServe API Endpoints

Endpoint	Port	Purpose
`/predictions/{model_name}`	8080	Inference API
`/models`	8081	Model management
`/metrics`	8082	Prometheus metrics
`/ping`	8080	Health check

Example Response

[
  {
    "predicted_class": "airplane",
    "confidence": 0.8523,
    "predictions": [
      {"class": "airplane", "confidence": 0.8523},
      {"class": "ship", "confidence": 0.0892},
      {"class": "bird", "confidence": 0.0341}
    ]
  }
]

Advanced: Scale Workers

# Scale up workers for better performance
curl -X PUT "http://localhost:8081/models/cifar10_classifier?min_worker=2&max_worker=4"

# Check worker status
curl http://localhost:8081/models/cifar10_classifier

Stop TorchServe

docker compose --profile serving down

Results

Expected Performance

With the implemented best practices:

Validation Accuracy: 75-82%
Training Time: ~30-60 minutes (CPU) / ~10-15 minutes (GPU)
Convergence: Typically in 40-60 epochs

Sample Output

Epoch 45/100
------------------------------------------------------------
Loss: 0.324 | Acc: 88.54% | GradNorm: 0.847
Validation - Loss: 0.512 | Accuracy: 81.23%
✓ Best model saved with accuracy: 81.23%

Per-Class Accuracy:
============================================================
airplane  : 84.2% (842/1000)
automobile: 88.1% (881/1000)
bird      : 73.5% (735/1000)
cat       : 68.9% (689/1000)
deer      : 77.8% (778/1000)
dog       : 75.3% (753/1000)
frog      : 87.6% (876/1000)
horse     : 86.4% (864/1000)
ship      : 89.2% (892/1000)
truck     : 87.3% (873/1000)
============================================================

Troubleshooting

Common Issues

1. Out of Memory (OOM)

# Reduce batch size in train.py
BATCH_SIZE = 64  # Instead of 128

2. Docker Build Fails

# Clear Docker cache and rebuild
docker compose build --no-cache train

3. Permission Denied on publish_docker.sh

chmod +x publish_docker.sh

4. TensorBoard Not Loading

# Ensure runs directory exists and has data
ls -la runs/

5. Model Not Found During Testing

# Verify checkpoint exists
ls -la checkpoints/
# Train first if no checkpoints exist
docker compose up train

Learning Resources

Computer Vision Fundamentals

PyTorch

Docker & MLOps

Key Papers

Contributing

This is an educational project. Suggestions and improvements are welcome!

License

MIT License - Feel free to use for educational purposes.

Authors

Computer Vision Seminar - University Demo Project

Happy Learning! 🎓

For questions or issues, please open a GitHub issue or contact the course instructor.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
model		model
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
compose.yml		compose.yml
publish_docker.sh		publish_docker.sh
pyproject.toml		pyproject.toml
test.py		test.py
train.py		train.py
uv.lock		uv.lock

cvail-research/demo-docker

Folders and files

Latest commit

History

Repository files navigation

CIFAR-10 Computer Vision Demo with Docker

Table of Contents

Overview

Dataset: CIFAR-10

Features

Modern Training Techniques

Production Features

Prerequisites

Quick Start

Option 1: Using Docker Compose (Recommended)

Option 2: Using Docker directly

Project Structure

Usage

Training

Testing

Local Development (Without Docker)

Training Details

Model Architecture

Data Augmentation

Learning Rate Schedule

Monitoring with TensorBoard

Option 1: Using Docker Compose (Recommended)

Option 2: Standalone TensorBoard

Accessing TensorBoard

Docker Hub Deployment

Model Serving with TorchServe

Step 1: Create Model Archive

Step 2: Start TorchServe

Step 3: Register Your Model

Step 4: Make Predictions

TorchServe API Endpoints

Example Response

Advanced: Scale Workers

Stop TorchServe

Results

Expected Performance

Sample Output

Troubleshooting

Common Issues

Learning Resources

Computer Vision Fundamentals

PyTorch

Docker & MLOps

Key Papers

Contributing

License

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages