Skip to content

Implement Google Drive Import Tool with Comprehensive Monitoring #4

@bryanchriswhite

Description

@bryanchriswhite

Summary

Implement a comprehensive Google Drive import tool that allows users to authenticate with their Google account, select files/folders from their Drive, and import them into PinShare with full monitoring capabilities. The tool should support both manual one-time imports and optional continuous synchronization.

User Story

As a PinShare user, I want to import my files from Google Drive into the decentralized PinShare network, so that I can:

  • Liberate my data from centralized cloud storage
  • Share my files via P2P/IPFS without relying on Google's infrastructure
  • Maintain a decentralized backup of my important files
  • Optionally keep my PinShare instance in sync with my Drive

Current State

PinShare currently supports file uploads via:

  • File system watcher monitoring an ./upload folder
  • Manual file placement by users

Architecture Gaps for Google Drive Import:

  1. ❌ No user authentication system (OAuth or otherwise)
  2. ❌ No Google Drive API integration
  3. ❌ No background job queue for long-running operations
  4. ❌ No per-user file import tracking
  5. ❌ Limited upload status tracking (designed for quick local file processing)
  6. ❌ No continuous sync/monitoring capability

Existing Assets to Leverage:

  • ✅ Robust upload pipeline (validation → hashing → security scanning → IPFS → metadata storage)
  • ✅ Upload status tracking system (UploadStatusManager)
  • ✅ Real-time UI updates via React Query
  • ✅ Security scanning infrastructure (VirusTotal, ClamAV, P2P-Sec)
  • ✅ P2P metadata distribution via PubSub

Proposed Solution

Architecture Components

1. User Authentication System

  • OAuth 2.0 with PKCE for Google Drive API access
  • Per-user token storage (encrypted)
  • Token refresh mechanism
  • Support for multiple users on the same PinShare instance

Tech Stack:

  • google.golang.org/api/drive/v3 for Google Drive API
  • golang.org/x/oauth2 for OAuth flow
  • Secure token storage (encrypted database or keyring)

2. Google Drive Integration

  • List user's Drive folders and files
  • Stream file downloads directly to PinShare
  • Folder hierarchy preservation (optional)
  • Metadata mapping (Drive metadata → PinShare metadata)

Features:

  • Folder tree browser UI
  • Multi-select file/folder selection
  • Filter by file type, size, date
  • Preview file list before import

3. Background Job System

  • Job queue for import operations
  • Worker pool for concurrent processing
  • Job persistence (survive restarts)
  • Progress tracking per job
  • Retry mechanism with exponential backoff

Job States:

pending → downloading → hashing → scanning → uploading → completed
                                                      └→ failed (with retry)

4. Enhanced Monitoring Dashboard

Real-time metrics:

  • Overall import progress (X of Y files, % complete)
  • Per-file status with detailed stages
  • Error tracking with specific failure reasons
  • Bandwidth metrics (current speed, average speed, ETA)
  • Success/failure statistics

Historical tracking:

  • Import job history
  • Per-file import logs
  • Retry attempts
  • Total data imported

5. Continuous Sync Engine (Phase 3)

  • Watch Google Drive for changes (polling or webhooks)
  • Auto-import new/modified files
  • Configurable sync interval
  • Conflict resolution strategy
  • Sync pause/resume capability

Technical Implementation

Backend API Endpoints

Authentication

POST /api/google-drive/authorize
  → Initiates OAuth flow, returns authorization URL

POST /api/google-drive/callback?code={authCode}
  → Exchanges auth code for tokens, stores encrypted

GET /api/google-drive/auth-status
  → Returns whether user is authenticated

DELETE /api/google-drive/revoke
  → Revokes access and deletes tokens

File Selection

GET /api/google-drive/folders?path={folderId}
  → Lists files/folders in specified folder (defaults to root)
  Response: { id, name, mimeType, size, modifiedTime, parents[] }

POST /api/google-drive/preview-import
  Request: { fileIds: [], folderIds: [], recursive: bool }
  Response: { files: [], totalSize, totalCount }

Import Operations

POST /api/google-drive/import
  Request: { 
    fileIds: [], 
    folderIds: [], 
    recursive: bool,
    options: { preserveHierarchy, skipDuplicates }
  }
  Response: { jobId, status, filesQueued }

GET /api/google-drive/import/{jobId}/status
  Response: { 
    jobId, 
    status: "pending|running|completed|failed|cancelled",
    progress: { 
      totalFiles, 
      completedFiles, 
      failedFiles, 
      currentFile,
      percentComplete,
      bytesTransferred,
      totalBytes,
      transferRate,
      estimatedTimeRemaining
    },
    files: [
      { 
        driveId, 
        fileName, 
        status: "pending|downloading|hashing|scanning|uploading|completed|failed",
        progress: 0-100,
        error: ""
      }
    ]
  }

POST /api/google-drive/import/{jobId}/cancel
  → Cancels running import job

POST /api/google-drive/import/{jobId}/retry-failed
  → Retries all failed files in the job

GET /api/google-drive/import/history
  → Returns list of past import jobs with summary stats

Continuous Sync (Phase 3)

POST /api/google-drive/sync/configure
  Request: { enabled, folderId, interval, options }
  
GET /api/google-drive/sync/status
  Response: { enabled, lastSync, nextSync, syncedFiles, errors }

POST /api/google-drive/sync/trigger
  → Manually triggers sync cycle

Database Schema

Users Table

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  google_id VARCHAR(255) UNIQUE NOT NULL,
  email VARCHAR(255) NOT NULL,
  encrypted_access_token TEXT,
  encrypted_refresh_token TEXT,
  token_expiry TIMESTAMP,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Import Jobs Table

CREATE TABLE import_jobs (
  id UUID PRIMARY KEY,
  user_id INTEGER REFERENCES users(id),
  status VARCHAR(50), -- pending, running, completed, failed, cancelled
  total_files INTEGER,
  completed_files INTEGER,
  failed_files INTEGER,
  total_bytes BIGINT,
  transferred_bytes BIGINT,
  started_at TIMESTAMP,
  completed_at TIMESTAMP,
  options JSONB,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Import Files Table

CREATE TABLE import_files (
  id SERIAL PRIMARY KEY,
  job_id UUID REFERENCES import_jobs(id),
  drive_file_id VARCHAR(255),
  file_name VARCHAR(500),
  file_size BIGINT,
  status VARCHAR(50), -- pending, downloading, hashing, scanning, uploading, completed, failed
  progress INTEGER, -- 0-100
  sha256_hash VARCHAR(64),
  ipfs_cid VARCHAR(255),
  error_message TEXT,
  retry_count INTEGER DEFAULT 0,
  started_at TIMESTAMP,
  completed_at TIMESTAMP,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Sync Configurations Table (Phase 3)

CREATE TABLE sync_configs (
  id SERIAL PRIMARY KEY,
  user_id INTEGER REFERENCES users(id),
  drive_folder_id VARCHAR(255),
  enabled BOOLEAN DEFAULT TRUE,
  sync_interval INTEGER, -- minutes
  last_sync_at TIMESTAMP,
  next_sync_at TIMESTAMP,
  options JSONB,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Internal Architecture

New Go Packages

internal/gdrive/

  • client.go - Google Drive API client wrapper
  • oauth.go - OAuth flow management
  • downloader.go - File download from Drive
  • mapper.go - Drive metadata → PinShare metadata conversion

internal/jobs/

  • queue.go - Job queue interface
  • worker.go - Worker pool implementation
  • import_job.go - Import job definition and state machine
  • persistence.go - Job state persistence

internal/users/

  • auth.go - User authentication
  • store.go - User data storage
  • tokens.go - Encrypted token management

internal/sync/ (Phase 3)

  • engine.go - Continuous sync orchestration
  • watcher.go - Drive change detection
  • scheduler.go - Sync scheduling

Integration with Existing Systems

Upload Pipeline Integration:

// In internal/jobs/import_job.go
func (j *ImportJob) processFile(driveFile *drive.File) error {
    // 1. Download from Google Drive
    j.updateFileStatus(driveFile.Id, "downloading", 0)
    localPath, err := j.driveClient.Download(driveFile)
    
    // 2. Plug into existing upload pipeline
    j.updateFileStatus(driveFile.Id, "hashing", 30)
    sha256 := psfs.ComputeSHA256(localPath)
    
    j.updateFileStatus(driveFile.Id, "scanning", 50)
    secResult := psfs.SecurityCheck(localPath, sha256)
    
    if !secResult.Safe {
        return j.failFile(driveFile.Id, "Security scan failed")
    }
    
    j.updateFileStatus(driveFile.Id, "uploading", 70)
    cid, err := psfs.AddFileIPFS(localPath)
    
    j.updateFileStatus(driveFile.Id, "storing", 90)
    metadata := store.BaseMetadata{
        FileSHA256: sha256,
        IPFSCID: cid,
        FileName: driveFile.Name,
        // ... map other Drive metadata
    }
    store.GlobalStore.AddFile(metadata)
    
    j.updateFileStatus(driveFile.Id, "completed", 100)
    return nil
}

UI Requirements

1. Google Drive Authorization Page

Location: /import/google-drive/authorize

Components:

  • "Connect to Google Drive" button
  • OAuth consent explanation
  • Permissions required list
  • Privacy policy link

2. Folder/File Selection Interface

Location: /import/google-drive/select

Features:

  • Tree view of Drive folders (collapsible)
  • File list view with checkboxes
  • Multi-select capability
  • File type icons
  • Size/date metadata display
  • Search/filter bar
  • "Select All" / "Deselect All" buttons
  • Preview import summary (X files, Y GB)
  • Import options:
    • Preserve folder hierarchy
    • Skip duplicates (by SHA256)
    • Include shared files
  • "Start Import" button

3. Import Status Dashboard

Location: /import/google-drive/status/{jobId}

Real-time Metrics:

╔════════════════════════════════════════════════════════╗
║  Import Progress                            [Cancel]  ║
╠════════════════════════════════════════════════════════╣
║  ████████████████░░░░░░░░░░  45% (45/100 files)      ║
║  ⬇ Downloading: document.pdf (2.5 MB/s)              ║
║  ⏱ ETA: 5 minutes                                     ║
║  📊 Status: 40 completed, 5 failed, 55 pending       ║
╚════════════════════════════════════════════════════════╝

Files:
┌─────────────────────────────────────────────────────┐
│ ✅ report.pdf         │ Completed  │ 2.3 MB │ 12:30 │
│ ⏳ presentation.pptx  │ Scanning   │ ████░░ │       │
│ ❌ large-video.mp4    │ Failed     │ Error: Too large│
│ ⏸ document.docx      │ Pending    │ 45 KB  │       │
└─────────────────────────────────────────────────────┘

[Retry Failed Files]  [View Details]

Detailed Per-File View:

  • File name with Drive icon
  • Progress bar for current file
  • Current stage (downloading/hashing/scanning/uploading)
  • Transfer speed
  • Success/error indicator
  • Retry button for failed files

4. Import History Page

Location: /import/google-drive/history

Display:

  • List of past import jobs
  • Job ID, start time, duration
  • Success/failure counts
  • Total data imported
  • "View Details" link to status page

5. Sync Configuration Page (Phase 3)

Location: /import/google-drive/sync

Settings:

  • Enable/disable continuous sync
  • Select Drive folder to sync
  • Sync interval (hourly, daily, etc.)
  • Conflict resolution strategy
  • Last sync timestamp
  • Manual "Sync Now" button

Monitoring & Observability

Metrics to Track

Job-Level Metrics:

  • gdrive_import_jobs_total{status="completed|failed|cancelled"}
  • gdrive_import_duration_seconds
  • gdrive_import_files_total{status="completed|failed"}
  • gdrive_import_bytes_total

File-Level Metrics:

  • gdrive_file_download_duration_seconds
  • gdrive_file_size_bytes{stage="downloaded|uploaded"}
  • gdrive_transfer_rate_bytes_per_second

API Metrics:

  • gdrive_api_requests_total{endpoint,status}
  • gdrive_api_errors_total{error_type}
  • gdrive_api_rate_limit_hits_total

Sync Metrics (Phase 3):

  • gdrive_sync_cycles_total{status}
  • gdrive_sync_new_files_detected
  • gdrive_sync_lag_seconds (time since last successful sync)

Logging Strategy

  • Structured JSON logs
  • Log levels: DEBUG, INFO, WARN, ERROR
  • Include job ID, user ID, file ID in all log entries
  • Detailed error logging with stack traces

Security Considerations

OAuth Security

  1. PKCE Flow - Use Proof Key for Code Exchange for additional security
  2. Token Encryption - Encrypt tokens at rest using AES-256
  3. Secure Storage - Store encrypted tokens in database or OS keyring
  4. Token Rotation - Implement automatic refresh token rotation
  5. Scope Minimization - Request only drive.readonly scope

API Security

  1. Rate Limiting - Respect Google Drive API quotas (per-user limits)
  2. Authentication Required - All endpoints require valid user session
  3. Input Validation - Validate all Drive file IDs, folder paths
  4. CORS - Restrict to localhost and configured domains

File Security

  1. Leverage Existing Scanning - All imported files go through security checks
  2. Size Limits - Enforce max file size limits
  3. Type Validation - Respect PinShare's allowed file types
  4. Malware Scanning - VirusTotal/ClamAV on all imports

Privacy

  1. User Data Isolation - Each user only sees their own imports
  2. Token Revocation - Support complete data deletion
  3. Audit Logging - Log all import operations

Testing Strategy

Unit Tests

  • Google Drive client mocking
  • OAuth flow state machine
  • Job queue operations
  • Metadata mapping accuracy

Integration Tests

  • End-to-end import flow with test files
  • OAuth callback handling
  • Database persistence
  • Worker pool concurrency

Load Tests

  • Import 1,000 files concurrently
  • Test with 10+ concurrent import jobs
  • Measure memory usage and performance
  • Test rate limit handling

Security Tests

  • OAuth PKCE flow validation
  • Token encryption/decryption
  • Unauthorized access attempts
  • Input validation edge cases

Implementation Phases

Phase 1: OAuth + Basic Import (MVP)

Goal: Import files manually from Google Drive

Deliverables:

  • User authentication system
  • Google Drive OAuth integration
  • Drive folder/file browser UI
  • Basic import job queue
  • Simple progress tracking
  • Integration with existing upload pipeline

Estimated Effort: 2-3 weeks

Phase 2: Enhanced Monitoring

Goal: Comprehensive import monitoring

Deliverables:

  • Per-file status tracking
  • Real-time progress updates (WebSocket)
  • Bandwidth/speed metrics
  • Error tracking and retry mechanism
  • Import history dashboard
  • Prometheus metrics

Estimated Effort: 1-2 weeks

Phase 3: Continuous Sync

Goal: Auto-sync Drive changes

Deliverables:

  • Drive change detection (polling)
  • Sync scheduler
  • Sync configuration UI
  • Conflict resolution
  • Sync pause/resume

Estimated Effort: 2-3 weeks

Phase 4: Performance & Polish

Goal: Production-ready reliability

Deliverables:

  • Performance optimization
  • Webhook support (vs polling)
  • Advanced filtering options
  • Batch operations
  • Comprehensive documentation

Estimated Effort: 1-2 weeks


Dependencies

External Services

  • Google Cloud Project - OAuth credentials, API enablement
  • Google Drive API v3 - File access
  • Database - PostgreSQL or SQLite for job/user persistence

Go Packages

require (
    google.golang.org/api v0.XXX
    golang.org/x/oauth2 v0.XXX
    github.com/lib/pq v1.XXX // PostgreSQL driver
    github.com/google/uuid v1.XXX // Job IDs
)

Configuration

# config.yaml additions
google_drive:
  oauth:
    client_id: "${GOOGLE_OAUTH_CLIENT_ID}"
    client_secret: "${GOOGLE_OAUTH_CLIENT_SECRET}"
    redirect_url: "http://localhost:9090/api/google-drive/callback"
    scopes:
      - "https://www.googleapis.com/auth/drive.readonly"
  
  import:
    max_concurrent_downloads: 5
    max_file_size_mb: 1024
    temp_download_dir: "./tmp/gdrive"
    
  rate_limiting:
    requests_per_second: 10
    burst: 20

Success Metrics

User Adoption

  • Number of users connecting Google Drive
  • Total files imported
  • Active sync configurations

Performance

  • Average import speed (files/minute, MB/s)
  • P95 latency for import operations
  • Error rate < 1%

Reliability

  • Job success rate > 99%
  • Retry success rate
  • Sync lag < 5 minutes (Phase 3)

Future Enhancements

Beyond Initial Implementation

  • Dropbox Integration - Apply same pattern to Dropbox
  • OneDrive Support - Microsoft OneDrive import
  • S3 Import - AWS S3 bucket import
  • Selective Export - PinShare → Google Drive
  • Smart Deduplication - Cross-user file deduplication
  • Bandwidth Scheduling - Import during off-peak hours
  • Multi-folder Sync - Sync multiple Drive folders simultaneously

Related Issues


References


Priority: High
Complexity: High
Impact: High - Unlocks PinShare for users with existing cloud storage

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions