Git–GCS Artifacts is a lightweight command-line system for managing large files and directories in Google Cloud Storage (GCS) while keeping Git repositories small, fast, and reproducible.
Instead of committing large binaries into Git history, the tools upload artifacts to GCS and store small pointer files with the .gcs suffix in the repository. Those pointers are versioned in Git and contain the metadata needed to restore the original artifacts later.
This is intended for workflows involving datasets, model checkpoints, simulation outputs, intermediate results, and other large binary assets that do not belong in normal Git history.
The toolchain provides five commands:
-
git_gcs_artifacts
The backend engine. This is the command the wrappers call internally. -
gcsinit
Initializes GCS support in the current repository. -
gcspush
Uploads a file or directory to GCS and creates a.gcspointer file next to it. -
gcspull
Restores files or directories from.gcspointer files. -
gcsstatus
Shows current pointer files and repository artifact status.
The workflow is intentionally simple:
- You run
gcspushon a file or directory. - The artifact is uploaded to GCS.
- A
.gcspointer file is written next to the original path. .gitignoreis updated so the large artifact itself is not tracked.- The pointer file is committed and pushed to Git by default.
- On another machine or clone,
gcspullrestores the artifact from the pointer.
This keeps the repository lightweight while preserving exact artifact references.
- Works in any Git repository
- Automatic per-repo bucket creation
- Pointer files stored next to original artifact paths
- Supports both files and directories
- Recursive push mode with a size threshold
- Default behavior is upload + commit + push
- Recursive mode uses a single batch commit and a single push
- Directory restore includes per-file SHA256 verification
- No repo-specific scripts are required after installation
By default, the bucket name is derived from the Git remote:
git-<org>-<repo>
Examples:
git-zeroknowledgediscovery-zebra-open
git-myorg-myrepo
If the Git remote cannot be parsed, the fallback bucket is:
git-local-<repo>
The bucket is created automatically if it does not already exist.
Pointer files are created in the same path as the target artifact, not in the current working directory.
Examples:
gcspush xx/yy/zz.csvcreates:
xx/yy/zz.csv.gcs
and
gcspush models/run_01creates:
models/run_01.gcs
This rule also holds in recursive mode.
Clone the tools repository and run the installer:
git clone <your-tools-repo>
cd <your-tools-repo>
bash install_gcs_git_tools.shThe installer places the commands in:
~/.local/bin
and installs:
git_gcs_artifacts
gcsinit
gcspush
gcspull
gcsstatus
Ensure ~/.local/bin is on your PATH.
For Bash:
export PATH="$HOME/.local/bin:$PATH"Initialize the current repository:
gcsinitPush a file:
gcspush data/file.parquetPush a directory:
gcspush models/run_01Restore a single file:
gcspull data/file.parquet.gcsRestore all tracked artifacts in the repository:
gcspull --allBy default, gcspush does three things beyond upload:
- stages the pointer file and
.gitignore - creates a Git commit
- pushes the commit to the current Git remote
So this:
gcspush data/file.parquetmeans:
- upload artifact
- create
data/file.parquet.gcs - update
.gitignore git addgit commitgit push
This default applies to both normal and recursive push mode.
If needed, this can be disabled with flags described below.
Initializes artifact storage support for the current repository.
This command:
- verifies that you are inside a Git repository
- determines the default bucket name
- creates the GCS bucket if needed
- ensures a
.gitignorefile exists - appends a section header for managed artifacts if needed
gcsinit [options]--bucket BUCKET
Use an explicit bucket name instead of the default derived name.
--location LOCATION
Set the GCS bucket location. Default is US.
--project PROJECT
Set the GCP project explicitly when creating the bucket.
-h, --help
Show help.
gcsinit
gcsinit --bucket git-my-explicit-bucket
gcsinit --location US-CENTRAL1
gcsinit --project my-gcp-projectUploads a file or directory to GCS and creates a .gcs pointer next to it.
There are two operating modes:
- normal mode
- recursive mode
Normal mode pushes one file or one directory.
gcspush PATH
gcspush PATH --no-commit
gcspush PATH --no-push
gcspush PATH --no-commit --no-push--no-commit
Upload the artifact and create the pointer, but do not commit changes.
--no-push
Commit locally but do not git push.
-h, --help
Show help.
Any backend arguments that your setup supports can be forwarded after the path.
gcspush data/train.parquet
gcspush data/train.parquet --no-push
gcspush models/run_01 --no-commit
gcspush outputs/final_model --no-commit --no-pushFor a file:
gcspush data/train.parquetthe tool will:
- upload
data/train.parquetto GCS - compute SHA256
- create
data/train.parquet.gcs - add
/data/train.parquetto.gitignore - add
!/data/train.parquet.gcsto.gitignore - stage the pointer and
.gitignore - commit
- push
For a directory:
gcspush models/run_01the tool will:
- upload the full directory recursively
- generate a manifest describing the files
- upload the manifest to GCS
- create
models/run_01.gcs - ignore the local directory in
.gitignore - commit and push the pointer by default
Recursive mode scans a directory tree and pushes only files meeting a size threshold.
This is useful when you want to process a project tree and externalize only large files while leaving small files in Git.
gcspush -r DIRECTORY
gcspush -r DIRECTORY -t 25
gcspush -r DIRECTORY --threshold-mb 25
gcspush -r DIRECTORY --no-push
gcspush -r DIRECTORY --no-commit
gcspush -r DIRECTORY --no-commit --no-push-r, --recursive
Enable recursive mode.
-t MB, --threshold-mb MB
Only push files whose size is greater than or equal to the threshold in megabytes. Default is 10.
--no-commit
Do not create the final batch commit.
--no-push
Create the batch commit locally but do not push.
-h, --help
Show help.
Push all files >= 10 MB:
gcspush -r ./dataPush all files >= 25 MB:
gcspush -r ./results -t 25Upload only, no commit or push:
gcspush -r ./artifacts --no-commit --no-pushCommit locally but do not push:
gcspush -r ./artifacts --no-pushRecursive mode does not make one commit per file.
Instead, it:
- finds all qualifying files
- pushes them one by one to GCS
- creates
.gcspointers next to each file - stages all pointers together
- stages
.gitignoreif changed - creates one single Git commit
- pushes once
This is deliberate. It keeps commit history clean and avoids one commit per artifact.
If recursive mode sees a file like:
xx/a/b/large.csv
it creates:
xx/a/b/large.csv.gcs
The pointer always lives next to the file it represents.
Restores artifacts from .gcs pointers.
It supports three patterns:
- restore one pointer
- restore all pointers under a directory
- restore all pointers in the repository
gcspull POINTER.gcs
gcspull DIRECTORY
gcspull --all--all
Restore every .gcs pointer in the repository.
-h, --help
Show help.
Restore one file:
gcspull data/train.parquet.gcsRestore all pointers under a directory:
gcspull modelsRestore all pointers in the repository:
gcspull --allFor a file pointer, gcspull:
- reads the GCS URI from the pointer
- downloads the file to the path obtained by removing
.gcs - verifies SHA256 if present
For a directory pointer, gcspull:
- downloads the manifest
- downloads every file in the manifest
- reconstructs the directory tree
- verifies SHA256 for each restored file
Shows the current repository status related to Git–GCS artifacts.
gcsstatus
gcsstatus --dir DIRECTORY--dir DIRECTORY
Restrict the pointer listing to a subtree.
-h, --help
Show help.
gcsstatus
gcsstatus --dir data
gcsstatus --dir modelsgit status --porcelain- a list of
.gcspointers under the chosen path - the default bucket name inferred for the repository
This is the backend command used by the wrappers. Most users will not call it directly, but it is useful for debugging or scripting.
git_gcs_artifacts init [options]
git_gcs_artifacts add --local PATH [options]
git_gcs_artifacts pull [--pointer POINTER.gcs | --dir DIRECTORY | --all] [options]
git_gcs_artifacts status [--dir DIRECTORY]Same function as gcsinit.
Direct backend for gcspush.
Example:
git_gcs_artifacts add --local data/train.parquet --commit --pushDirect backend for gcspull.
Examples:
git_gcs_artifacts pull --pointer data/train.parquet.gcs
git_gcs_artifacts pull --dir models
git_gcs_artifacts pull --allDirect backend for gcsstatus.
version: 1
type: file
uri: gs://git-myorg-myrepo/data/train.parquet
sha256: 2d4f6f5f4c...
size_bytes: 842931234
source_repo_relpath: data/train.parquet
version: 1
type: dir
uri: gs://git-myorg-myrepo/models/run_01
manifest_uri: gs://git-myorg-myrepo/models/run_01.__manifest__.tsv
manifest_sha256: 7305f7...
file_count: 12
total_bytes: 3240932
source_repo_relpath: models/run_01
When an artifact is pushed, the tool updates .gitignore so the artifact itself is not committed, but the pointer remains trackable.
For a file:
/data/train.parquet
!/data/train.parquet.gcs
For a directory:
/models/run_01
!/models/run_01.gcs
This is appended exactly once per managed path.
gcspush data/cohort.parquetResult:
data/cohort.parquetuploadeddata/cohort.parquet.gcscommitted and pushed
gcspush models/checkpoint_17Result:
- entire directory uploaded
models/checkpoint_17.gcscommitted and pushed
gcspush -r ./results -t 50Result:
- all files >= 50 MB uploaded
- one batch commit
- one Git push
gcspull --all- Git
- Google Cloud SDK with
gcloud storage - authenticated GCP environment
- permission to create and write GCS buckets
- a valid Git repository for normal operation
Typical authentication:
gcloud auth login
gcloud config set project <project-id>This tool is deliberately simple.
It is meant to provide:
- explicit artifact locations
- reproducible restore paths
- minimal hidden state
- easy debugging
- clean Git history
It does not try to be a full data versioning framework. It is a practical bridge between Git and GCS.
Advantages:
- external object storage is explicit
- bucket per repo can be created automatically
- no LFS server dependency
- pointer files are transparent text
Tradeoff:
- less integrated with Git hosting platforms
Advantages:
- simpler mental model
- much less machinery
- easier to inspect and debug
Tradeoff:
- fewer pipeline and data-versioning features
- Bucket names must be globally unique in GCS.
- Recursive mode only pushes files meeting the threshold.
- Recursive mode uses one final commit, not one commit per file.
- If you pass
--no-commit, no commit is created. - If you pass
--no-push, the commit remains local. - Pointer placement follows the target path, not the current directory.
gcsinit
gcspush data/train.parquet
gcspush models/run_01
gcspush -r ./results -t 25
gcspull data/train.parquet.gcs
gcspull models
gcspull --all
gcsstatusGPL-3.0