Skip to content

sumitst05/yoink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Yoink ⚑

A high-performance, deduplicating backup tool built in Go for blazingly fast and efficient incremental backups.

What's different?

Yoink is a proof-of-concept, high-performance backup tool, designed to be incredibly fast and efficient by leveraging advanced techniques to avoid backing up redundant data.

Traditional backups waste space by re-copying entire files, Yoink uses Content-Defined Chunking (CDC) to find duplicate data inside your files. This means your 2nd, 3rd, and 10th backups are incredibly small and fast, only storing what has truly changed.

This project is built entirely in Go to take full advantage of its powerful concurrency, making both backup and restore operations utilize your CPU and disk I/O for maximum speed.

Core Technologies

  • Content-Defined Chunking (CDC): Instead of splitting files at fixed 4MB intervals, Yoink uses a rolling hash (via the restic/chunker library) to find "natural" cut-points in your data.
    • Why? If you insert one byte at the beginning of a 10GB file:
      • Traditional: Every single subsequent chunk changes. The entire 10GB file is re-uploaded.
      • Yoink (CDC): Only the first chunk changes. The other 99.9% of the file's data remains in identical chunks that are already in the repository and are not re-uploaded.
  • Content-Addressable Storage (CAS): Every piece of data (a "chunk") and metadata (a "tree") is stored in a file named after its own SHA-256 hash. This provides "free" and automatic global deduplication. If two files share the same chunk, it's only stored once.
  • High-Concurrency: The entire backup and restore process is heavily parallelized using goroutines and errgroup to process many files and directories at the same time.

πŸš€ Features

  • Blazing Fast Performance: Fully parallel backup and restore operations that can saturate multi-core CPUs and high-speed SSDs.
  • Efficient Deduplication: Uses CDC to deduplicate data at the sub-file level.
  • Snapshot-Based: Backups are atomic, point-in-time "snapshots." You can restore your filesystem to exactly how it looked at any backup time.
  • Simple & Clean CLI: A modern CLI interface built with cobra.
  • Bit-for-Bit Correctness: Verified end-to-end restores are cryptographically identical to the original source.

πŸ› οΈ Setup & Usage

Prerequisites

  • Go (1.24 or later)
  • Git

Installation & Build

# 1. Clone the repository
git clone https://github.com/sumitst05/yoink.git
cd yoink

# 2. Build the binary
go build .

# 3. (Optional) Install the binary to your $PATH
# This will build and place the binary in your $GOPATH/bin
go install .

Usage

All backup data is stored in your user's config directory determined by your operating system:

  • Linux: ~/.config/yoink/data
  • macOS: ~/Library/Application Support/yoink/data
  • Windows: C:\Users\<UserName>\AppData\Roaming\yoink\data

To backup a directory or file:

# Usage: yoink backup [source_path]
yoink backup /home/user/Documents

To restore a snapshot::

# Usage: yoink restore [destination_path]
yoink restore /home/user/restore-location

# The tool will then prompt you to select a snapshot:
#
# Available snapshots:
# [1] 2025-11-04 14:29:13  /home/user/Documents
# [2] 2025-11-04 14:35:01  /home/user/Videos/vids
# Enter snapshot number to restore: 2

πŸ—οΈ Design and Architecture

1. Repository

All data is stored in a content-addressable "key-value" store, where the "key" is the SHA-256 hash of the "value". To avoid putting millions of files in one folder, the first 2 characters of the hash are used as a subdirectory.

~/.config/yoink/data/
β”œβ”€β”€ chunks/             # Stores all unique, raw data chunks
β”‚   β”œβ”€β”€ 0a/
β”‚   β”‚   └── 0a1b2c...
β”‚   β”œβ”€β”€ f2/
β”‚   β”‚   └── f293ab...
β”‚   └── ...
β”œβ”€β”€ metadata/           # Stores all unique metadata objects (Trees & Manifests)
β”‚   β”œβ”€β”€ 4a/
β”‚   β”‚   └── 4a52cc...
β”‚   β”œβ”€β”€ d1/
β”‚   β”‚   └── d122ab...
β”‚   └── ...
β”œβ”€β”€ snapshots/          # Human-readable entry points for restores (timestamp as name)
β”‚   β”œβ”€β”€ 2025-11-04_14-29-13.json
β”‚   └── 2025-11-04_14-35-01.json
└── temp/               # For concurrency-safe atomic file writes

2. Core Data Structures

Tihs is the hierarcy that enables rebuilding a file directory.

  • Snapshot (snapshots/*.json): The top-level entry point. It's a simple JSON file that points to the root Tree object of a specific backup.
{
  "time": "2025-11-04T14:35:01Z",
  "source_path": "/home/user/Videos/vids",
  "root_tree_id": "d122ab..."
}
  • Tree (metadata/[hash]): Represents a single directory. It's a map of filenames to their metadata, pointing to either a FileManifest (for files) or another Tree (for subdirectories).
{
  "type": "tree",
  "entries": {
    "video1.mp4": {
      "type": "file",
      "mode": 420,
      "mod_time": "...",
      "metadata_id": "4a52cc..."
    },
    "archive/": {
      "type": "tree",
      "mode": 493,
      "mod_time": "...",
      "metadata_id": "b789a0..."
    }
  }
}
  • FileManifest (metadata/[hash]): Represents a single file. It's an ordered list of the chunk hashes that, when combined, rebuild the original file.
{
  "type": "file",
  "chunks": ["f293ab...", "e571cd...", "8921ef..."]
}

3. Backup Workflow

The backup process is a recursive, bottom-up walk of the filesystem. To build a Tree object for a directory, it must first know the hashes of all its children.

backup_workflow
  1. File: When walkNode hits a file, it calls processFile.
  • The file is streamed into the chunker.
  • The chunker produces variable-sized chunks.
  • Each chunk is hashed and saved to chunks/ (e.g., f293ab...).
  • A FileManifest is created with the list of chunk hashes.
  • The FileManifest is marshaled to JSON, hashed, and saved to metadata/ (e.g., 4a52cc...).
  • This final manifest hash (4a52cc...) is returned.
  1. Directory: When walkNode hits a directory, it waits for all of its children to be processed (in parallel).
  • It collects all the returned hashes (e.g., "video1.mp4" -> "4a52cc...").
  • It assembles these into a single Tree object.
  • This Tree is marshaled to JSON, hashed, and saved to metadata/.
  • This final Tree hash is returned to its parent.
  1. Finish: This continues all the way up to the root, producing a single root_tree_id, which is then saved in a new Snapshot.

4. Restore Workflow

The restore process is the direct inverse... i.e, a recursive, top-down walk.

restore_workflow
  1. The user picks a Snapshot, giving us the root_tree_id.

  2. restoreNode loads the root Tree object from metadata/.

  3. For each entry in the Tree:

  • If type == "tree": It calls os.Mkdir to create the directory, then recurses into restoreNode with the child's metadata_id.
  • If type == "file": It loads the FileManifest, calls os.Create to make the new file, and then loops through the chunks list. For each chunk hash, it reads from chunks/ and writes the data to the file, in order.
  1. Finally, os.Chmod sets the correct permissions.

⚑ Concurrency with Goroutines

The serial version would be slow and bottlenecked by file I/O. Yoink uses golang.org/x/sync/errgroup to parallelize this "walk" for both backup and restore.

  • When walkNode (or restoreNode) enters a directory with 100 files, it does not loop 100 times.
  • Instead, it launches 100 separate goroutines using g.Go(func() ...).
  • Each goroutine processes its own file or subdirectory independently and concurrently.
  • errgroup automatically manages all of this:
    • Concurrency: It handles running many goroutines at once.
    • Error Handling: The first goroutine to return an error (e.g., "file not readable") will immediately signal all other goroutines in that group to stop.
    • Cancellation: This is done via the context.Context that is passed down, which all goroutines check.
  • Result: During a backup, 16 files might be getting chunked, 8 files might be getting hashed, and 12 files might be getting saved to disk, all at the exact same time. This allows Yoink to saturate all available CPU cores and I/O bandwidth, resulting in benchmarked 500%+ CPU usage and incredible speed.

πŸš€ Performance & Verification

Here is a test of Yoink backing up a 2.4GB directory of video files and restoring them.

1. The Source Data

First, here is the ls -lh and du -h of the source directory, ~/Videos/vids. It contains 30 video files, totaling 2.4GB.

swappy-20251105_162100

2. The Parallel Backup

Next, we run the time bin/yoink backup command.

swappy-20251105_162445

Look at those benchmark numbers:

  • Total Time: The entire 2.4GB directory was backed up in just 4.271 seconds.
  • CPU Usage: 589% cpu. This is the most important number. We are saturating almost 6 full CPU cores to hash and chunk multiple files at once.

3. The Parallel Restore

Now, we restore that 2.4GB of data to a new, empty directory.

swappy-20251105_162905

The restore is also incredibly fast:

  • Total Time: The entire 2.4GB directory was restored in 4.147 seconds.
  • CPU Usage: 110% cpu. This shows the restore is also running in parallel, reading multiple chunks and writing multiple files at the same time. The restore is I/O-bound (limited by disk speed), and these results mean that we are successfully maxing out the disk's capability.

4. The Golden Test (Verification)

Finally, we run a diff between the original source directory and the newly restored one.

swappy-20251105_162934

No output.

The restored data is bit-for-bit identical to the original. This proves the entire end-to-end pipelineβ€”chunking, hashing, saving, loading, and rebuildingβ€”is 100% correct.


πŸ’‘ Future Roadmap

  • Client-Side Encryption: Encrypting all chunks and metadata with AES-GCM before they are saved, using a password-derived key.
  • Compression: Using zstd to compress chunks on the fly for even greater space savings.
  • Pruning & Garbage Collection: A yoink prune command to safely remove old snapshots and delete any chunks they (and only they) used.
  • Cloud Storage Backends: Abstracting the repository to an interface to support saving to Google Drive, Dropbox, and S3.
  • GUI: A GTK frontend for the application.

Contributing

Pull requests for bug fixes or new features (especially those on the roadmap!) are always welcome.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages