Skip to content

feat: Port to serverless GCS and add configurable folder depth (jules)#9

Open
solsson wants to merge 3 commits intomainfrom
feat/gcs-serverless-port
Open

feat: Port to serverless GCS and add configurable folder depth (jules)#9
solsson wants to merge 3 commits intomainfrom
feat/gcs-serverless-port

Conversation

@solsson
Copy link
Contributor

@solsson solsson commented May 27, 2025

This commit introduces a Google Cloud Storage (GCS) serverless implementation alongside the existing MinIO setup. Blob uploads to a designated "write" GCS bucket trigger a Cloud Function that performs hash-based deduplication and directory sharding before moving the blob to a "read" GCS bucket.

Key changes and features:

  1. GCS Implementation (gcs/ directory):

    • A new Go module repos.se/minio-deduplication/v2/gcs contains the GCS-specific logic.
    • A Google Cloud Function (gcs_transfer.HandleGCSEvent) processes blob uploads.
    • Uses SHA256 hashing and a 2-level directory sharding strategy (e.g., aa/bb/).
    • Logging is implemented using go.uber.org/zap.
  2. Configurable Folder Depth (GCS specific):

    • A new feature allows preserving a configurable number of leading directory levels from the source path in the destination path.
    • Controlled by the PRESERVED_FOLDER_DEPTH environment variable for the Cloud Function.
    • For testing, this can be overridden per-object via the preserved-depth-override GCS metadata key.
    • Defaults to 0 (no preservation), maintaining backward compatibility with the original MinIO behavior.
  3. Refactored Integration Tests (integration_tests/go/):

    • Existing bash+curl tests were ported to a Go test framework using the standard testing package.
    • A StorageService interface abstracts backend operations (MinIO, GCS).
    • An AppMonitor interface abstracts application monitoring (metrics for MinIO, logs for GCS).
    • Tests can be run against either MinIO or GCS by setting the TEST_TARGET environment variable.
    • Includes TestBasicUploadAndTransfer ported from basic-flow.sh.
    • New integration tests (TestFolderDepthFeatureGCS) specifically validate the configurable folder depth feature on GCS.
  4. Unit Tests:

    • Unit tests for the GCS transfer logic (gcs_transfer_test.go) cover path generation, hashing, and the folder depth feature.
    • The GCS code was refactored for better testability (e.g., using interfaces).
  5. Documentation (gcs/README.md):

    • Detailed instructions on provisioning required Google Cloud services (GCS buckets, Cloud Function, IAM).
    • Guidance on configuring service accounts and authorizing integration tests.
    • A sample gcloud deployment command for the Cloud Function.

Both MinIO and GCS implementations are now expected to pass the common integration tests (basic flow), while the new folder depth feature is specific to and tested on GCS.

This commit introduces a Google Cloud Storage (GCS) serverless implementation alongside the existing MinIO setup. Blob uploads to a designated "write" GCS bucket trigger a Cloud Function that performs hash-based deduplication and directory sharding before moving the blob to a "read" GCS bucket.

Key changes and features:

1.  **GCS Implementation (`gcs/` directory):**
    *   A new Go module `repos.se/minio-deduplication/v2/gcs` contains the GCS-specific logic.
    *   A Google Cloud Function (`gcs_transfer.HandleGCSEvent`) processes blob uploads.
    *   Uses SHA256 hashing and a 2-level directory sharding strategy (e.g., `aa/bb/`).
    *   Logging is implemented using `go.uber.org/zap`.

2.  **Configurable Folder Depth (GCS specific):**
    *   A new feature allows preserving a configurable number of leading directory levels from the source path in the destination path.
    *   Controlled by the `PRESERVED_FOLDER_DEPTH` environment variable for the Cloud Function.
    *   For testing, this can be overridden per-object via the `preserved-depth-override` GCS metadata key.
    *   Defaults to 0 (no preservation), maintaining backward compatibility with the original MinIO behavior.

3.  **Refactored Integration Tests (`integration_tests/go/`):**
    *   Existing bash+curl tests were ported to a Go test framework using the standard `testing` package.
    *   A `StorageService` interface abstracts backend operations (MinIO, GCS).
    *   An `AppMonitor` interface abstracts application monitoring (metrics for MinIO, logs for GCS).
    *   Tests can be run against either MinIO or GCS by setting the `TEST_TARGET` environment variable.
    *   Includes `TestBasicUploadAndTransfer` ported from `basic-flow.sh`.
    *   New integration tests (`TestFolderDepthFeatureGCS`) specifically validate the configurable folder depth feature on GCS.

4.  **Unit Tests:**
    *   Unit tests for the GCS transfer logic (`gcs_transfer_test.go`) cover path generation, hashing, and the folder depth feature.
    *   The GCS code was refactored for better testability (e.g., using interfaces).

5.  **Documentation (`gcs/README.md`):**
    *   Detailed instructions on provisioning required Google Cloud services (GCS buckets, Cloud Function, IAM).
    *   Guidance on configuring service accounts and authorizing integration tests.
    *   A sample `gcloud` deployment command for the Cloud Function.

Both MinIO and GCS implementations are now expected to pass the common integration tests (basic flow), while the new folder depth feature is specific to and tested on GCS.
@solsson
Copy link
Contributor Author

solsson commented May 27, 2025

This was the task description given to Jules:

I'd like a port of this project so that the same logic runs serverless on google cloud. A blob upload to a "write" bucket should trigger a move to a "read" bucket using the same hash based deduplication strategy and directory sharding.

The feature branch should be named feat/gcs-serverless-port

I want to use Go for this so we can reuse code if necessary.
For now use a subdir gcs with its own go.mod.

I want both implementations to pass the same integration tests. It's likely that the current bash+curl based tests must be ported to a test framework that can use a thin abstraction.

I want the port to have one new feature that the golang+minio impl currently does not have: A configurable folder depth that should be preserved. If I set that to 1 the first directory level of the write path should be used for the read path as well, with the sharding placed under it. For example with value 2 a blob written to testbucket-write/xyz/foo/file.png with checksum 0123456789 should be placed at testbucket-read/xyz/foo/01/23/0123456789.png. The feature should be backwards compatible (default is to preserve zero levels) and have its own unit and integration tests.

Please document how to provision the necessary features in google cloud and how to authorize tests to use these resources. I expect that you get integration tests up and running early on in the implementation process.

If there's a choice of logging library I prefer zap with replace globals and zap.L() based statements.

This commit introduces a GitHub Actions workflow to automate testing for both unit and integration tests.

The workflow (`.github/workflows/go-tests.yml`) includes the following jobs:
- `unit-tests`: Runs unit tests for the main Go module and the `gcs` submodule.
- `integration-tests-minio`: Starts MinIO and the application using Docker Compose, then runs integration tests against the MinIO backend.
- `integration-tests-gcs`: Authenticates to Google Cloud using Workload Identity Federation and runs integration tests against the GCS backend and deployed Cloud Function.

Integration tests have been tagged with `//go:build integration` to allow separation from unit tests.

The `gcs/README.md` file has been updated with a new section "CI/CD with GitHub Actions" detailing:
- The workflow structure.
- Instructions for running tests locally.
- Required GitHub Secrets for GCS integration tests (e.g., `GCP_PROJECT_ID`, `GCP_SA_EMAIL`, GCS bucket names, Cloud Function details).
- Notes on setting up Workload Identity Federation in GCP.
Updates the Go version used in the GitHub Actions workflow (`.github/workflows/go-tests.yml`) from '1.22' to '1.23'.

This change ensures that the CI environment uses the same Go version as specified in your project's `go.mod` files (which indicate Go 1.23.x). This prevents potential compatibility issues where code using Go 1.23 features might fail to compile in a CI environment running an older Go version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant