Skip to content

persistentStorage#1318

Open
alexcos20 wants to merge 9 commits intomainfrom
feature/persistentStorage
Open

persistentStorage#1318
alexcos20 wants to merge 9 commits intomainfrom
feature/persistentStorage

Conversation

@alexcos20
Copy link
Copy Markdown
Member

@alexcos20 alexcos20 commented Apr 7, 2026

What it is

Persistent Storage is a simple bucket + file store intended for long-lived artifacts that Ocean Node needs to keep across requests and restarts, and to reference them in compute jobs for faster access. Files cannot be download (thus eliminating unwanted content).

Key primitives:

  • Bucket: a logical container for files.
  • File: binary content stored inside a bucket.
  • Bucket registry: a local SQLite table that stores bucket metadata (owner, access lists, createdAt).

Architecture (high level)

Components

  • Handlers (protocol layer): src/components/core/handler/persistentStorage.ts

    • Implements protocol commands such as create bucket, list files, upload, delete, and get buckets.
    • Validates auth (token or signature) and applies high-level authorization checks.
  • Persistent storage backends (storage layer): src/components/persistentStorage/*

    • PersistentStorageFactory: shared functionality (SQLite bucket registry, access list checks).
    • PersistentStorageLocalFS: local filesystem backend.
    • PersistentStorageS3: stub for future S3-compatible backend.
  • HTTP routes (HTTP interface): src/components/httpRoutes/persistentStorage.ts

    • Exposes REST-ish endpoints under /api/services/persistentStorage/... that call the same handlers. (see API.md)
  • Compute integration (localfs): starting a job validates that the consumer may use each nodePersistentStorage dataset file object; the Docker C2D engine bind-mounts those files into the container at /data/persistentStorage/<bucketId>/<fileName> (read-only).

Features

  • Create bucket

    • Creates a bucket id (UUID), persists it in SQLite with owner and accessListJson, and creates a local directory (localfs).
  • List buckets (by owner)

    • Returns buckets from the registry filtered by owner (mandatory arg).
    • Applies ACL filtering for non-owners.
  • Upload file

    • Writes a stream to the backend.
    • Enforces bucket ACL.
  • List files

    • Returns file metadata (name, size, lastModified) for a bucket.
    • Enforces bucket ACL.
  • Delete file

    • Deletes the named file from the bucket.
    • Enforces bucket ACL.
  • getFileObject

    • Returns fileObject format for c2d use
    • Enforces bucket ACL.

Compute (startCompute + Docker engine)

  • PaidComputeStartHandler / FreeComputeStartHandler: for each algorithm/dataset entry, if fileObject is nodePersistentStorage (and persistentStorage.type === 'localfs' is enabled), the node checks assertConsumerAllowedForBucket before proceeding; misconfiguration or denial returns 400/403 as appropriate.
  • compute_engine_docker: when creating the algorithm container, bind-mounts assets whose fileObject.type === 'nodePersistentStorage via getDockerMountObject; uploadData skips downloading those assets (data comes from the mount).
  • Job outputs remain packaged as data/outputs/outputs.tar on the host; tests read outputs/ps-result.txt from inside that archive.

Test plan

  • nvm use
  • npm run build-tests
  • npm run mocha "./dist/test/integration/persistentStorage.test.js"
  • npm run test:computeintegration -- --grep "persistent storage"

@alexcos20
Copy link
Copy Markdown
Member Author

/run-security-scan

Copy link
Copy Markdown
Member Author

@alexcos20 alexcos20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This pull request introduces a new 'Persistent Storage' feature for Ocean Node, allowing users to store and manage files directly on the node, primarily for use in C2D (Compute-to-Data) jobs. It provides both P2P commands and HTTP API endpoints for various operations like creating buckets, uploading/listing/deleting files, and retrieving file objects. The initial implementation includes a local filesystem backend, with a placeholder for future S3 integration. Crucially, it integrates robust access control mechanisms based on bucket ownership and configurable Access Lists, extending these checks into the C2D workflow to ensure only authorized consumers can use persistent storage files in their computations. The PR also includes comprehensive documentation, new configuration schemas, and extensive integration tests covering happy paths and security scenarios. The P2P protocol has been enhanced to support streaming request bodies, which is vital for file uploads.

Comments:
• [INFO][refactor] The refactoring to handle streaming request bodies (p2pStreamBody flag and Readable.from with LP frames) is a significant improvement, enabling efficient file uploads over P2P. This change is crucial for the new persistent storage feature.
• [INFO][performance] Redacting stream and rawData payloads from logs is a good practice. It prevents excessive logging of potentially large or sensitive data, which can improve log readability and performance, and reduce storage costs.
• [INFO][bug] Improving error propagation by extracting httpStatus from caught errors and using it in sendErrorAndClose is a good fix. This ensures more accurate error reporting to the client instead of a generic 500.
• [INFO][feature] The integration of nodePersistentStorage file objects into the C2D Docker engine as bind mounts is well-implemented. This enables compute jobs to seamlessly access data stored in the node's persistent storage. The error handling for missing bucketId/fileName and disabled persistent storage is also robust.
• [WARNING][security] The functions rejectPersistentStorageFileObjectOnAlgorithm and ensureConsumerAllowedForPersistentStorageLocalfsFileObject are critical for security. Rejecting persistent storage for algorithms prevents potential arbitrary code execution or data exfiltration risks, while enforcing ACLs for datasets ensures data confidentiality. These checks are well-placed in ComputeInitializeHandler and PaidComputeStartHandler/FreeComputeStartHandler.
• [INFO][refactor] The requirePersistentStorage helper function is useful for centralizing the logic to check if persistent storage is enabled and available, making handlers cleaner and preventing redundant checks.
• [WARNING][security] In PersistentStorageListFilesHandler and other handlers, the e instanceof PersistentStorageAccessDeniedError check correctly translates access denied errors into HTTP 403. However, for generic 500 errors, e.message is returned directly. While generally acceptable for internal errors, ensure e.message doesn't inadvertently leak sensitive system details in other potential error paths. It seems to be handled correctly for specific errors like 'file not found' (404).
• [INFO][refactor] The readRawBody helper function is a good way to handle raw byte uploads for POST requests, ensuring compatibility with various content types. This is essential for the file upload functionality.
• [BUG][performance] The change to gracefully handle non-cloneable objects (like streams) during structuredClone for logging is a good fix. It prevents potential crashes in validateCommandParameters when a command contains such properties, ensuring stability without sacrificing logging capabilities.
• [INFO][style] The abstract PersistentStorageFactory is a well-designed pattern for supporting multiple storage backends (e.g., localfs, S3). It clearly defines the interface and provides common functionality like the SQLite bucket registry.
• [WARNING][security] The validateBucket function, specifically enforcing UUIDv4 format for bucketId, is a good security measure. This helps prevent path traversal attacks by ensuring bucketId cannot contain malicious characters that could manipulate file paths. This is essential, especially for the localfs backend.
• [INFO][refactor] The new checkAddressOnAccessList utility function that accepts an OceanNode instance and an array of AccessList is a more flexible and robust approach for access control. It centralizes the logic for checking multiple chains and access lists, simplifying its use across various components.
• [WARNING][security] The ensureBucketExists function includes crucial path validation by checking resolvedBucketPath !== bucketsRoot && !resolvedBucketPath.startsWith(bucketsRoot + path.sep). This is an effective measure against path traversal by ensuring that operations are constrained within the designated storage root for buckets. Combined with validateBucket, this significantly reduces risk.
• [WARNING][security] The ensureFileExists function's check for fileName.includes('/') || fileName.includes('\\') is a vital safeguard against path traversal within a specific bucket. It ensures that file names cannot escape their intended directory, which is important for maintaining the integrity and security of the storage system.
• [INFO][feature] The PersistentStorageS3 class serving as a placeholder with 'not implemented yet' errors is a good pattern. It outlines the future direction and API contract for an S3 backend without delaying the release of the localfs functionality.
• [INFO][feature] The PersistentStorageConfigSchema with Zod validation is excellent. It ensures that the configuration for persistent storage is correctly structured and validated, providing robustness and clarity for node operators, especially with conditional requirements for localfs vs. s3 options.

@alexcos20 alexcos20 marked this pull request as ready for review April 9, 2026 07:23
@alexcos20
Copy link
Copy Markdown
Member Author

/run-security-scan

Copy link
Copy Markdown
Member Author

@alexcos20 alexcos20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This pull request introduces a major new feature: Persistent Storage for Ocean Node. It provides a local filesystem backend, with a placeholder for an S3 backend, to store long-lived artifacts for C2D jobs. The changes span new P2P commands and HTTP endpoints, dedicated handlers, comprehensive access control, and integration with the C2D Docker engine via bind mounts. Extensive documentation and integration tests are also included.

Comments:
• [INFO][other] The addition of stream?: Readable | null to the base Command interface is a significant change, fundamentally altering how commands can carry data. While necessary for streaming uploads, ensure that all existing command handlers are robust to this new optional field, even if they don't use it. Also, consider if there are any unintended side effects or compatibility issues with older P2P clients that don't expect a stream.
• [WARNING][performance] The logPayload cloning and stringifying for logging can be expensive, especially for large rawData or stream payloads. While stream is already redacted, rawData might still contain large buffers. Consider if [${logPayload.rawData.length} bytes] is sufficient for all logging levels, or if there are scenarios where parts of rawData might still be inadvertently logged or stringified unnecessarily, impacting performance for high-volume streams.
• [INFO][other] Good improvement to log the correct HTTP status and message from the error object, if available. This will greatly help in debugging issues with various handlers returning specific HTTP error codes.
• [INFO][other] The toUint8ArrayChunk function handles various chunk types, which is good for flexibility. Ensure comprehensive testing, especially for edge cases where ArrayBuffer.isView might behave unexpectedly or different buffer types could be passed that are not explicitly handled.
• [INFO][other] The integration with the C2D Docker engine via bind mounts for nodePersistentStorage is well-implemented. The checks for bucketId/fileName and persistent storage availability are crucial. Ensure that the target path inside the Docker container (/data/persistentStorage/...) is documented for users creating algorithms that consume persistent storage.
• [INFO][security] Rejecting nodePersistentStorage file objects for algorithms and enforcing ensureConsumerAllowedForPersistentStorageLocalfsFileObject for datasets is a strong security and design choice. This prevents misuse and ensures proper authorization for compute jobs accessing sensitive data.
• [INFO][security] The node-level allow list check (config.persistentStorage?.accessLists) for PersistentStorageCreateBucketHandler is a good layer of defense, controlling who can create buckets at all. This is a critical distinction from bucket-specific ACLs.
• [INFO][other] The dbReadyPromise pattern is a robust way to ensure SQLite operations only proceed after the schema is created, preventing race conditions during initialization. This is a good implementation detail for reliability.
• [SECURITY][bug] The path traversal checks in ensureBucketExists using path.resolve and startsWith are critical for security. Similarly, the fileName validation preventing path separators (/, \\) is essential. Double-check any future operations that construct paths to ensure they adhere to these same strict validation principles.
• [INFO][other] The new checkAddressOnAccessList utility simplifies access control checks by iterating through chains. This is a good refactoring, making it easier to integrate access list checks across the codebase. Ensure all consumers of access list checks are updated to use this new utility where appropriate.
• [INFO][other] The PersistentStorageConfigSchema with its superRefine for validating localfs and s3 options is very well-structured. This ensures configuration correctness from the start and reduces potential runtime errors.
• [INFO][other] The new integration tests for persistent storage with compute jobs are excellent. They cover critical happy path and authorization denial scenarios, significantly increasing confidence in the feature's correctness and security.
• [INFO][other] The dedicated integration tests for the persistent storage API are comprehensive, covering creation, upload, listing, deletion, file object retrieval, and various access control cases. This strong test coverage is highly valuable for a new, security-sensitive feature.
• [INFO][documentation] The new persistentStorage.md documentation is thorough and well-structured, covering architecture, access control, and usage examples. This will be invaluable for developers and users. Please give it a final read-through for any potential ambiguities, especially in the 'Configuration' and 'Usage' sections, to ensure it's crystal clear for new users.

@alexcos20 alexcos20 requested a review from ndrpp April 9, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants