Git–GCS Artifacts

Git–GCS Artifacts is a lightweight command-line system for managing large files and directories in Google Cloud Storage (GCS) while keeping Git repositories small, fast, and reproducible.

Instead of committing large binaries into Git history, the tools upload artifacts to GCS and store small pointer files with the .gcs suffix in the repository. Those pointers are versioned in Git and contain the metadata needed to restore the original artifacts later.

This is intended for workflows involving datasets, model checkpoints, simulation outputs, intermediate results, and other large binary assets that do not belong in normal Git history.

What it does

The toolchain provides five commands:

git_gcs_artifacts
The backend engine. This is the command the wrappers call internally.
gcsinit
Initializes GCS support in the current repository.
gcspush
Uploads a file or directory to GCS and creates a .gcs pointer file next to it.
gcspull
Restores files or directories from .gcs pointer files.
gcsstatus
Shows current pointer files and repository artifact status.

Core model

The workflow is intentionally simple:

You run gcspush on a file or directory.
The artifact is uploaded to GCS.
A .gcs pointer file is written next to the original path.
.gitignore is updated so the large artifact itself is not tracked.
The pointer file is committed and pushed to Git by default.
On another machine or clone, gcspull restores the artifact from the pointer.

This keeps the repository lightweight while preserving exact artifact references.

Key features

Works in any Git repository
Automatic per-repo bucket creation
Pointer files stored next to original artifact paths
Supports both files and directories
Recursive push mode with a size threshold
Default behavior is upload + commit + push
Recursive mode uses a single batch commit and a single push
Directory restore includes per-file SHA256 verification
No repo-specific scripts are required after installation

Bucket naming

By default, the bucket name is derived from the Git remote:

git-<org>-<repo>

Examples:

git-zeroknowledgediscovery-zebra-open
git-myorg-myrepo

If the Git remote cannot be parsed, the fallback bucket is:

git-local-<repo>

The bucket is created automatically if it does not already exist.

Pointer behavior

Pointer files are created in the same path as the target artifact, not in the current working directory.

Examples:

gcspush xx/yy/zz.csv

creates:

xx/yy/zz.csv.gcs

and

gcspush models/run_01

creates:

models/run_01.gcs

This rule also holds in recursive mode.

Installation

Clone the tools repository and run the installer:

git clone <your-tools-repo>
cd <your-tools-repo>
bash install_gcs_git_tools.sh

The installer places the commands in:

~/.local/bin

and installs:

git_gcs_artifacts
gcsinit
gcspush
gcspull
gcsstatus

Ensure ~/.local/bin is on your PATH.

For Bash:

export PATH="$HOME/.local/bin:$PATH"

Quick start

Initialize the current repository:

gcsinit

Push a file:

gcspush data/file.parquet

Push a directory:

gcspush models/run_01

Restore a single file:

gcspull data/file.parquet.gcs

Restore all tracked artifacts in the repository:

gcspull --all

Default commit behavior

By default, gcspush does three things beyond upload:

stages the pointer file and .gitignore
creates a Git commit
pushes the commit to the current Git remote

So this:

gcspush data/file.parquet

means:

upload artifact
create data/file.parquet.gcs
update .gitignore
git add
git commit
git push

This default applies to both normal and recursive push mode.

If needed, this can be disabled with flags described below.

Full command reference

1. `gcsinit`

Initializes artifact storage support for the current repository.

This command:

verifies that you are inside a Git repository
determines the default bucket name
creates the GCS bucket if needed
ensures a .gitignore file exists
appends a section header for managed artifacts if needed

Syntax

gcsinit [options]

Options

--bucket BUCKET
Use an explicit bucket name instead of the default derived name.

--location LOCATION
Set the GCS bucket location. Default is US.

--project PROJECT
Set the GCP project explicitly when creating the bucket.

-h, --help
Show help.

Examples

gcsinit
gcsinit --bucket git-my-explicit-bucket
gcsinit --location US-CENTRAL1
gcsinit --project my-gcp-project

2. `gcspush`

Uploads a file or directory to GCS and creates a .gcs pointer next to it.

There are two operating modes:

normal mode
recursive mode

Normal mode

Normal mode pushes one file or one directory.

Syntax

gcspush PATH
gcspush PATH --no-commit
gcspush PATH --no-push
gcspush PATH --no-commit --no-push

Options

--no-commit
Upload the artifact and create the pointer, but do not commit changes.

--no-push
Commit locally but do not git push.

-h, --help
Show help.

Any backend arguments that your setup supports can be forwarded after the path.

Examples

gcspush data/train.parquet
gcspush data/train.parquet --no-push
gcspush models/run_01 --no-commit
gcspush outputs/final_model --no-commit --no-push

What happens in normal mode

For a file:

gcspush data/train.parquet

the tool will:

upload data/train.parquet to GCS
compute SHA256
create data/train.parquet.gcs
add /data/train.parquet to .gitignore
add !/data/train.parquet.gcs to .gitignore
stage the pointer and .gitignore
commit
push

For a directory:

gcspush models/run_01

the tool will:

upload the full directory recursively
generate a manifest describing the files
upload the manifest to GCS
create models/run_01.gcs
ignore the local directory in .gitignore
commit and push the pointer by default

3. `gcspush -r` / recursive mode

Recursive mode scans a directory tree and pushes only files meeting a size threshold.

This is useful when you want to process a project tree and externalize only large files while leaving small files in Git.

Syntax

gcspush -r DIRECTORY
gcspush -r DIRECTORY -t 25
gcspush -r DIRECTORY --threshold-mb 25
gcspush -r DIRECTORY --no-push
gcspush -r DIRECTORY --no-commit
gcspush -r DIRECTORY --no-commit --no-push

Options

-r, --recursive
Enable recursive mode.

-t MB, --threshold-mb MB
Only push files whose size is greater than or equal to the threshold in megabytes. Default is 10.

--no-commit
Do not create the final batch commit.

--no-push
Create the batch commit locally but do not push.

-h, --help
Show help.

Examples

Push all files >= 10 MB:

gcspush -r ./data

Push all files >= 25 MB:

gcspush -r ./results -t 25

Upload only, no commit or push:

gcspush -r ./artifacts --no-commit --no-push

Commit locally but do not push:

gcspush -r ./artifacts --no-push

Recursive commit behavior

Recursive mode does not make one commit per file.

Instead, it:

finds all qualifying files
pushes them one by one to GCS
creates .gcs pointers next to each file
stages all pointers together
stages .gitignore if changed
creates one single Git commit
pushes once

This is deliberate. It keeps commit history clean and avoids one commit per artifact.

Recursive pointer behavior

If recursive mode sees a file like:

xx/a/b/large.csv

it creates:

xx/a/b/large.csv.gcs

The pointer always lives next to the file it represents.

4. `gcspull`

Restores artifacts from .gcs pointers.

It supports three patterns:

restore one pointer
restore all pointers under a directory
restore all pointers in the repository

Syntax

gcspull POINTER.gcs
gcspull DIRECTORY
gcspull --all

Options

--all
Restore every .gcs pointer in the repository.

-h, --help
Show help.

Examples

Restore one file:

gcspull data/train.parquet.gcs

Restore all pointers under a directory:

gcspull models

Restore all pointers in the repository:

gcspull --all

Pull behavior

For a file pointer, gcspull:

reads the GCS URI from the pointer
downloads the file to the path obtained by removing .gcs
verifies SHA256 if present

For a directory pointer, gcspull:

downloads the manifest
downloads every file in the manifest
reconstructs the directory tree
verifies SHA256 for each restored file

5. `gcsstatus`

Shows the current repository status related to Git–GCS artifacts.

Syntax

gcsstatus
gcsstatus --dir DIRECTORY

Options

--dir DIRECTORY
Restrict the pointer listing to a subtree.

-h, --help
Show help.

Examples

gcsstatus
gcsstatus --dir data
gcsstatus --dir models

Output includes

git status --porcelain
a list of .gcs pointers under the chosen path
the default bucket name inferred for the repository

6. `git_gcs_artifacts` backend

This is the backend command used by the wrappers. Most users will not call it directly, but it is useful for debugging or scripting.

Syntax

git_gcs_artifacts init [options]

git_gcs_artifacts add --local PATH [options]

git_gcs_artifacts pull [--pointer POINTER.gcs | --dir DIRECTORY | --all] [options]

git_gcs_artifacts status [--dir DIRECTORY]

Backend subcommands

`git_gcs_artifacts init`

Same function as gcsinit.

`git_gcs_artifacts add`

Direct backend for gcspush.

Example:

git_gcs_artifacts add --local data/train.parquet --commit --push

`git_gcs_artifacts pull`

Direct backend for gcspull.

Examples:

git_gcs_artifacts pull --pointer data/train.parquet.gcs
git_gcs_artifacts pull --dir models
git_gcs_artifacts pull --all

`git_gcs_artifacts status`

Direct backend for gcsstatus.

Pointer file format

File pointer example

version: 1
type: file
uri: gs://git-myorg-myrepo/data/train.parquet
sha256: 2d4f6f5f4c...
size_bytes: 842931234
source_repo_relpath: data/train.parquet

Directory pointer example

version: 1
type: dir
uri: gs://git-myorg-myrepo/models/run_01
manifest_uri: gs://git-myorg-myrepo/models/run_01.__manifest__.tsv
manifest_sha256: 7305f7...
file_count: 12
total_bytes: 3240932
source_repo_relpath: models/run_01

`.gitignore` behavior

When an artifact is pushed, the tool updates .gitignore so the artifact itself is not committed, but the pointer remains trackable.

For a file:

/data/train.parquet
!/data/train.parquet.gcs

For a directory:

/models/run_01
!/models/run_01.gcs

This is appended exactly once per managed path.

Typical workflows

Single large file

gcspush data/cohort.parquet

Result:

data/cohort.parquet uploaded
data/cohort.parquet.gcs committed and pushed

Model directory

gcspush models/checkpoint_17

Result:

entire directory uploaded
models/checkpoint_17.gcs committed and pushed

Sweep a tree for large files

gcspush -r ./results -t 50

Result:

all files >= 50 MB uploaded
one batch commit
one Git push

Restore everything after clone

gcspull --all

Requirements

Git
Google Cloud SDK with gcloud storage
authenticated GCP environment
permission to create and write GCS buckets
a valid Git repository for normal operation

Typical authentication:

gcloud auth login
gcloud config set project <project-id>

Design philosophy

This tool is deliberately simple.

It is meant to provide:

explicit artifact locations
reproducible restore paths
minimal hidden state
easy debugging
clean Git history

It does not try to be a full data versioning framework. It is a practical bridge between Git and GCS.

Comparison with alternatives

Compared with Git LFS

Advantages:

external object storage is explicit
bucket per repo can be created automatically
no LFS server dependency
pointer files are transparent text

Tradeoff:

less integrated with Git hosting platforms

Compared with DVC

Advantages:

simpler mental model
much less machinery
easier to inspect and debug

Tradeoff:

fewer pipeline and data-versioning features

Known behavior and caveats

Bucket names must be globally unique in GCS.
Recursive mode only pushes files meeting the threshold.
Recursive mode uses one final commit, not one commit per file.
If you pass --no-commit, no commit is created.
If you pass --no-push, the commit remains local.
Pointer placement follows the target path, not the current directory.

Example session

gcsinit

gcspush data/train.parquet

gcspush models/run_01

gcspush -r ./results -t 25

gcspull data/train.parquet.gcs

gcspull models

gcspull --all

gcsstatus

License

GPL-3.0

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LICENSE		LICENSE
README.md		README.md
install_gcs_git_tools		install_gcs_git_tools
test_gcs_git_tools.sh		test_gcs_git_tools.sh

Folders and files

Latest commit

History

Repository files navigation

Git–GCS Artifacts

What it does

Core model

Key features

Bucket naming

Pointer behavior

Installation

Quick start

Default commit behavior

Full command reference

1. gcsinit

Syntax

Options

Examples

2. gcspush

Normal mode

Syntax

Options

Examples

What happens in normal mode

3. gcspush -r / recursive mode

Syntax

Options

Examples

Recursive commit behavior

Recursive pointer behavior

4. gcspull

Syntax

Options

Examples

Pull behavior

5. gcsstatus

Syntax

Options

Examples

Output includes

6. git_gcs_artifacts backend

Syntax

Backend subcommands

git_gcs_artifacts init

git_gcs_artifacts add

git_gcs_artifacts pull

git_gcs_artifacts status

Pointer file format

File pointer example

Directory pointer example

.gitignore behavior

Typical workflows

Single large file

Model directory

Sweep a tree for large files

Restore everything after clone

Requirements

Design philosophy

Comparison with alternatives

Compared with Git LFS

Compared with DVC

Known behavior and caveats

Example session

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `gcsinit`

2. `gcspush`

3. `gcspush -r` / recursive mode

4. `gcspull`

5. `gcsstatus`

6. `git_gcs_artifacts` backend

`git_gcs_artifacts init`

`git_gcs_artifacts add`

`git_gcs_artifacts pull`

`git_gcs_artifacts status`

`.gitignore` behavior

Packages