This repo builds Docker images and provisioning scripts for Vast.ai GPU templates. Every image inherits shared infrastructure (Supervisor, Instance Portal, Caddy, Jupyter, workspace sync) from vastai/base-image, letting you focus on application-specific setup.
There are three ways to create a template, listed in order of preference:
| Class | When to use | Stability |
|---|---|---|
| Derivative image | Default choice for all new templates | High — dependencies baked in, version-controlled, reproducible |
| External image | Adapting a large, trusted upstream (vLLM, SGLang, Ollama) | High — multi-stage build wrapping upstream with Vast infrastructure |
| Provisioning-only | First drafts and proof-of-concept | Low — runtime installs can break over time. Convert to derivative once validated |
base-image/
├── ROOT/ # Filesystem overlay (copied into base image as-is)
│ ├── etc/
│ │ └── vast_boot.d/ # Numbered boot scripts (run in order at startup)
│ └── opt/
│ └── supervisor-scripts/
│ └── utils/ # Shared utilities for supervisor scripts
├── derivatives/ # Derivative images
│ └── pytorch/ # vastai/pytorch (FROM vastai/base-image)
│ ├── derivatives/
│ │ └── comfyui/ # vastai/comfyui (FROM vastai/pytorch)
│ │ ├── Dockerfile
│ │ └── ROOT/ # App-specific overlay
│ └── provisioning_scripts/ # Runtime-install scripts for pytorch-based templates
├── external/ # External images (adapted upstream)
│ ├── ollama/
│ ├── sglang/
│ └── vllm/
├── provisioning_scripts/ # Runtime-install scripts for base-image templates
├── portal-aio/ # Instance Portal build artifacts
├── tools/ # Build utilities
└── .github/
├── AGENTS.md # CI/CD conventions (authoritative reference)
└── workflows/ # One build-*.yml per image
Image hierarchy:
vastai/base-image
├── vastai/pytorch
│ ├── vastai/comfyui
│ ├── vastai/ostris
│ └── ...
├── vastai/linux-desktop
├── vastai/llama-cpp
└── vastai/tensorflow
External images (vLLM, SGLang, Ollama) start from their upstream base and graft Vast infrastructure on top.
Every image inherits:
- Supervisor — process manager for all services
- Instance Portal — web UI with tabbed access to services
- Caddy — reverse proxy / TLS termination
- Jupyter — notebook server
- Python venv at
/venv/main/ uv— fast Python package installer- Workspace sync —
/opt/workspace-internal/syncs to$WORKSPACE(default/workspace) on volumes - Boot sequence — numbered scripts in
ROOT/etc/vast_boot.d/
| Variable | Purpose |
|---|---|
WORKSPACE |
User workspace directory (default /workspace) |
DATA_DIRECTORY |
Alias for WORKSPACE |
PORTAL_CONFIG |
Defines Instance Portal tabs (see Portal Integration) |
PROVISIONING_SCRIPT |
URL of a provisioning script to run at boot |
SERVERLESS |
Set to "true" to skip portal-dependent services |
PROC_NAME |
Set by supervisor — the [program:name] value |
Scripts in ROOT/etc/vast_boot.d/ run in numeric order:
05-configure-cuda.sh # CUDA setup
10-prep-env.sh # Environment preparation
15-hotfix.sh # Runtime hotfixes
25-first-boot.sh # First boot tasks
35-sync-home-dirs.sh # Home directory sync
36-sync-workspace.sh # Workspace sync (workspace-internal → $WORKSPACE)
37-sync-environment.sh # Environment sync
45-user-write-bashrc.sh # User shell config
46-user-propagate-ssh-keys.sh
47-user-git-safe-dirs.sh
48-venv-backup.sh # Venv state backup
55-tls-cert-gen.sh # TLS certificate generation
65-supervisor-launch.sh # Start supervisor
75-provisioning-script.sh # Run PROVISIONING_SCRIPT (if set)
95-supervisor-wait.sh # Wait for supervisor readiness
During boot, the file /.provisioning exists. Supervisor scripts should wait for it to be removed before starting their application:
while [ -f "/.provisioning" ]; do
echo "$PROC_NAME startup paused until instance provisioning has completed"
sleep 5
doneThis is the preferred approach for all new templates.
Derivative images use FROM vastai/pytorch:<tag> (or FROM vastai/base-image:<tag>) and bake all dependencies into the image at build time.
Reference: derivatives/pytorch/derivatives/comfyui/Dockerfile
ARG PYTORCH_BASE=vastai/pytorch:2.9.1-cu128-cuda-12.9-mini-py312
FROM ${PYTORCH_BASE}
# Maintainer details
LABEL org.opencontainers.image.source="https://github.com/vastai/"
LABEL org.opencontainers.image.description="<App> image suitable for Vast.ai."
LABEL maintainer="Vast.ai Inc <contact@vast.ai>"
# Copy Supervisor configuration and startup scripts
COPY ./ROOT /
RUN \
set -euo pipefail && \
. /venv/main/bin/activate && \
# Record pre-install PyTorch version
torch_version_pre="$(python -c 'import torch; print (torch.__version__)')" && \
# Install application dependencies
cd /opt/workspace-internal/ && \
git clone https://github.com/org/app && \
cd app && \
uv pip install --no-cache-dir -r requirements.txt && \
# Verify PyTorch version unchanged
torch_version_post="$(python -c 'import torch; print (torch.__version__)')" && \
[[ $torch_version_pre = $torch_version_post ]] || \
{ echo "PyTorch version mismatch (wanted ${torch_version_pre} but got ${torch_version_post})"; exit 1; }
# Defend against environment clashes when syncing to volume
RUN \
set -euo pipefail && \
env-hash > /.env_hash- Always
set -euo pipefailin RUN commands - Activate venv with
. /venv/main/bin/activate - Clone into
/opt/workspace-internal/(auto-synced to$WORKSPACEat boot) - Use
uv pip install— never plainpip - Verify PyTorch version before and after dependency installs
- End with
env-hash > /.env_hashto detect environment drift - Include the three
LABELlines COPY ./ROOT /to install supervisor scripts and configs
derivatives/pytorch/derivatives/my-app/
├── Dockerfile
└── ROOT/
├── etc/
│ └── supervisor/
│ └── conf.d/
│ └── my-app.conf
└── opt/
└── supervisor-scripts/
└── my-app.sh
Use this only for large, established upstream projects where rebuilding from scratch is impractical (e.g., vLLM, SGLang, Ollama).
Reference: external/vllm/Dockerfile
ARG VLLM_BASE=vllm/vllm-openai:v0.13.0
ARG VAST_BASE=vastai/base-image:stock-ubuntu24.04-py312
FROM ${VAST_BASE} AS vast_base_image
FROM ${VLLM_BASE} AS vllm_build
# Maintainer details
LABEL org.opencontainers.image.source="https://github.com/vastai/"
LABEL org.opencontainers.image.description="vLLM image suitable for Vast.ai."
LABEL maintainer="Vast.ai Inc <contact@vast.ai>"
### Convert non-Vast image to more closely resemble images derived from vastai/base-image ###
SHELL ["/bin/bash", "-c"]
ENV DATA_DIRECTORY=/workspace
ENV WORKSPACE=/workspace
ENV PIP_BREAK_SYSTEM_PACKAGES=1
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV PATH=/opt/instance-tools/bin:/opt/sys-venv/shim:${PATH}
WORKDIR /
COPY --from=base_image_source /ROOT /
COPY --from=base_image_source /portal-aio /opt/portal-aio
COPY --from=vast_base_image /opt/portal-aio/caddy_manager/caddy /opt/portal-aio/caddy_manager/caddy
COPY --from=base_image_source tools/convert-non-vast-image.sh /tmp/convert-non-vast-image.sh
ARG TARGETARCH
RUN \
set -euo pipefail && \
chmod +x /tmp/convert-non-vast-image.sh && \
/tmp/convert-non-vast-image.sh && \
rm /tmp/convert-non-vast-image.sh
### Begin app-specific configuration ###
# Copy Supervisor configuration and startup scripts
COPY ./ROOT /
ENTRYPOINT ["/opt/instance-tools/bin/entrypoint.sh"]
CMD []- Multi-stage build: pull both the upstream image and
vastai/base-image - Run
convert-non-vast-image.shto graft Vast infrastructure onto the upstream image - Copy portal-aio and caddy from the Vast base
COPY ./ROOT /for app-specific supervisor scripts- Set
ENTRYPOINTto Vast's entrypoint
external/my-app/
├── Dockerfile
└── ROOT/
├── etc/
│ ├── supervisor/
│ │ └── conf.d/
│ │ └── my-app.conf
│ └── vast_boot.d/
│ └── 05-my-app-env.sh # Set PORTAL_CONFIG and other env vars
└── opt/
└── supervisor-scripts/
└── my-app.sh
Use for first drafts and rapid prototyping. These run at boot time on a stock vastai/pytorch (or vastai/base-image) instance via the PROVISIONING_SCRIPT environment variable.
Warning: provisioning-only templates can break over time (upstream changes, network issues at boot, dependency conflicts). Convert to a derivative image once validated.
Reference: derivatives/pytorch/provisioning_scripts/qwen3-tts.sh
#!/bin/bash
set -euo pipefail
# 1. System dependencies (if needed)
apt-get install --no-install-recommends -y sox
# 2. Activate the shared venv
. /venv/main/bin/activate
# 3. Clone the application
cd "${WORKSPACE}"
[[ ! -d My-App ]] && git clone https://github.com/org/My-App
cd My-App
# 4. Install Python dependencies
uv pip install -r requirements.txt
# 5. Create the supervisor startup script
cat > /opt/supervisor-scripts/my-app.sh << 'EOL'
#!/bin/bash
utils=/opt/supervisor-scripts/utils
. "${utils}/logging.sh"
. "${utils}/cleanup_generic.sh"
. "${utils}/environment.sh"
. "${utils}/exit_portal.sh" "My App"
. /venv/main/bin/activate
echo "Starting My App"
cd "${WORKSPACE}/My-App"
python app.py
EOL
chmod +x /opt/supervisor-scripts/my-app.sh
# 6. Create the supervisor config
cat > /etc/supervisor/conf.d/my-app.conf << 'EOL'
[program:my-app]
environment=PROC_NAME="%(program_name)s"
command=/opt/supervisor-scripts/my-app.sh
autostart=true
autorestart=true
exitcodes=0
startsecs=0
stopasgroup=true
killasgroup=true
stopsignal=TERM
stopwaitsecs=10
stdout_logfile=/dev/stdout
redirect_stderr=true
stdout_events_enabled=true
stdout_logfile_maxbytes=0
stdout_logfile_backups=0
EOL
# 7. Register with supervisor
supervisorctl reread
supervisorctl updateSet the PROVISIONING_SCRIPT environment variable on the Vast.ai template to the raw URL of the script (e.g., a GitHub raw link). The boot sequence downloads and executes it at step 75.
Every application runs as a supervised process. You need two files: a startup script and a config.
#!/bin/bash
utils=/opt/supervisor-scripts/utils
. "${utils}/logging.sh"
. "${utils}/cleanup_generic.sh"
. "${utils}/environment.sh"
. "${utils}/exit_portal.sh" "App Name"
. /venv/main/bin/activate
while [ -f "/.provisioning" ]; do
echo "$PROC_NAME startup paused until instance provisioning has completed"
sleep 5
done
echo "Starting App Name"
cd "${WORKSPACE}/my-app"
python app.pyEach sourced utility does one thing:
| Utility | What it does |
|---|---|
logging.sh |
Redirects stdout/stderr to /var/log/portal/${PROC_NAME}.log (tee'd) |
cleanup_generic.sh |
Sets a trap to kill all subprocesses on EXIT/INT/TERM |
environment.sh |
Sources /etc/environment and ${WORKSPACE}/.env |
exit_portal.sh "<name>" |
Waits for /etc/portal.yaml, exits if the app isn't listed (user toggled it off) |
exit_serverless.sh |
Exits if $SERVERLESS is "true" (skip non-essential services in serverless mode) |
Source order matters. Always source in this order: logging.sh, cleanup_generic.sh, environment.sh, then exit_portal.sh or exit_serverless.sh.
[program:my-app]
environment=PROC_NAME="%(program_name)s"
command=/opt/supervisor-scripts/my-app.sh
autostart=true
autorestart=true
exitcodes=0
startsecs=0
stopasgroup=true
killasgroup=true
stopsignal=TERM
stopwaitsecs=10
stdout_logfile=/dev/stdout
redirect_stderr=true
stdout_events_enabled=true
stdout_logfile_maxbytes=0
stdout_logfile_backups=0The environment=PROC_NAME="%(program_name)s" line is required — it sets the PROC_NAME variable used by the utilities.
The Instance Portal provides tabbed browser access to services. It is configured via the PORTAL_CONFIG environment variable.
PORTAL_CONFIG="localhost:internal:external:path:Label|localhost:internal:external:path:Label|..."
Fields are colon-separated, entries are pipe-separated:
| Field | Example | Description |
|---|---|---|
| Host | localhost |
Always localhost |
| Internal port | 7860 |
Port the app listens on inside the container |
| External port | 17860 |
Port exposed to the user (convention: internal + 10000) |
| Path | / |
URL path for the tab link |
| Label | My App |
Tab label shown in the portal |
PORTAL_CONFIG="localhost:1111:11111:/:Instance Portal|localhost:7860:17860:/:My App|localhost:8080:18080:/:Jupyter"The exit_portal.sh utility checks whether the app's label appears in /etc/portal.yaml. If a user removes the entry, the supervisor script exits gracefully — this is how users toggle apps on and off.
# From exit_portal.sh — searches for the app name in portal config
if ! grep -qiE "^[^#].*${search_term}" /etc/portal.yaml; then
echo "Skipping ${PROC_NAME} startup (not in /etc/portal.yaml)"
sleep 6
exit 0
fi- Single shared venv at
/venv/main/— all applications share it - Always use
uv pip install— never plainpip - PyTorch installs must target a concrete backend:
Never use
uv pip install torch torchvision torchaudio --torch-backend cu128
--torch-backend=auto— it can select the wrong backend and produce non-reproducible builds. - In derivative Dockerfiles, verify PyTorch version before and after installing dependencies to catch accidental overwrites
Full CI/CD conventions are in .github/AGENTS.md. Key points:
- Every image has a workflow at
.github/workflows/build-<name>.yml - 4-job pipeline:
preflight→build→collect-tags→notify - The
buildjob always uses a matrix strategy (even for single-variant images) - Tags with commit hashes include an ISO 8601 date (
v1-a1b2c3d-2026-02-02-cuda-12.9); version tags do not
- File named
.github/workflows/build-<name>.yml - Schedule cron:
'0 0,12 * * *' -
workflow_dispatchinputs: VERSION/REF, DOCKERHUB_REPO, MULTI_ARCH, CUSTOM_IMAGE_TAG -
envblock: DEFAULT_DOCKERHUB_REPO, DEFAULT_MULTI_ARCH, RELEASE_AGE_THRESHOLD - 4 jobs:
preflight→build→collect-tags→notify -
buildusesstrategy.matrixand derivesMATRIX_IDviamd5sum | cut -c1-8 -
collect-tagsandnotifyjobs copied verbatim from an existing workflow
See .github/AGENTS.md for the full specification including code templates.
- Create
derivatives/pytorch/derivatives/<name>/Dockerfile - Create
derivatives/pytorch/derivatives/<name>/ROOT/opt/supervisor-scripts/<name>.sh - Create
derivatives/pytorch/derivatives/<name>/ROOT/etc/supervisor/conf.d/<name>.conf - Follow the Dockerfile pattern —
FROM, labels,COPY ./ROOT /, venv activate, install, PyTorch verify,env-hash - Follow the supervisor script template and config template
- Add a CI workflow at
.github/workflows/build-<name>.yml(see CI/CD) - Test locally with
docker build
- Create
external/<name>/Dockerfile - Create
external/<name>/ROOT/opt/supervisor-scripts/<name>.sh - Create
external/<name>/ROOT/etc/supervisor/conf.d/<name>.conf - Create
external/<name>/ROOT/etc/vast_boot.d/05-<name>-env.shto setPORTAL_CONFIG - Follow the external Dockerfile pattern — multi-stage build,
convert-non-vast-image.sh,COPY ./ROOT / - Add a CI workflow at
.github/workflows/build-<name>.yml
- Create
derivatives/pytorch/provisioning_scripts/<name>.sh(orprovisioning_scripts/<name>.shfor base-image) - Follow the complete provisioning pattern — deps, venv, clone, install, supervisor script, supervisor config,
supervisorctl reread && update - Test by setting
PROVISIONING_SCRIPTto the raw URL on a Vast.ai instance - Plan to convert to a derivative image once the template is validated