A high-performance, deduplicating backup tool built in Go for blazingly fast and efficient incremental backups.
Yoink is a proof-of-concept, high-performance backup tool, designed to be incredibly fast and efficient by leveraging advanced techniques to avoid backing up redundant data.
Traditional backups waste space by re-copying entire files, Yoink uses Content-Defined Chunking (CDC) to find duplicate data inside your files. This means your 2nd, 3rd, and 10th backups are incredibly small and fast, only storing what has truly changed.
This project is built entirely in Go to take full advantage of its powerful concurrency, making both backup and restore operations utilize your CPU and disk I/O for maximum speed.
- Content-Defined Chunking (CDC): Instead of splitting files at fixed 4MB intervals, Yoink uses a rolling hash (via the
restic/chunkerlibrary) to find "natural" cut-points in your data.- Why? If you insert one byte at the beginning of a 10GB file:
- Traditional: Every single subsequent chunk changes. The entire 10GB file is re-uploaded.
- Yoink (CDC): Only the first chunk changes. The other 99.9% of the file's data remains in identical chunks that are already in the repository and are not re-uploaded.
- Why? If you insert one byte at the beginning of a 10GB file:
- Content-Addressable Storage (CAS): Every piece of data (a "chunk") and metadata (a "tree") is stored in a file named after its own SHA-256 hash. This provides "free" and automatic global deduplication. If two files share the same chunk, it's only stored once.
- High-Concurrency: The entire backup and restore process is heavily parallelized using goroutines and
errgroupto process many files and directories at the same time.
- Blazing Fast Performance: Fully parallel backup and restore operations that can saturate multi-core CPUs and high-speed SSDs.
- Efficient Deduplication: Uses CDC to deduplicate data at the sub-file level.
- Snapshot-Based: Backups are atomic, point-in-time "snapshots." You can restore your filesystem to exactly how it looked at any backup time.
- Simple & Clean CLI: A modern CLI interface built with
cobra. - Bit-for-Bit Correctness: Verified end-to-end restores are cryptographically identical to the original source.
- Go (1.24 or later)
- Git
# 1. Clone the repository
git clone https://github.com/sumitst05/yoink.git
cd yoink
# 2. Build the binary
go build .
# 3. (Optional) Install the binary to your $PATH
# This will build and place the binary in your $GOPATH/bin
go install .All backup data is stored in your user's config directory determined by your operating system:
- Linux:
~/.config/yoink/data - macOS:
~/Library/Application Support/yoink/data - Windows:
C:\Users\<UserName>\AppData\Roaming\yoink\data
To backup a directory or file:
# Usage: yoink backup [source_path]
yoink backup /home/user/DocumentsTo restore a snapshot::
# Usage: yoink restore [destination_path]
yoink restore /home/user/restore-location
# The tool will then prompt you to select a snapshot:
#
# Available snapshots:
# [1] 2025-11-04 14:29:13 /home/user/Documents
# [2] 2025-11-04 14:35:01 /home/user/Videos/vids
# Enter snapshot number to restore: 2All data is stored in a content-addressable "key-value" store, where the "key" is the SHA-256 hash of the "value". To avoid putting millions of files in one folder, the first 2 characters of the hash are used as a subdirectory.
~/.config/yoink/data/
βββ chunks/ # Stores all unique, raw data chunks
β βββ 0a/
β β βββ 0a1b2c...
β βββ f2/
β β βββ f293ab...
β βββ ...
βββ metadata/ # Stores all unique metadata objects (Trees & Manifests)
β βββ 4a/
β β βββ 4a52cc...
β βββ d1/
β β βββ d122ab...
β βββ ...
βββ snapshots/ # Human-readable entry points for restores (timestamp as name)
β βββ 2025-11-04_14-29-13.json
β βββ 2025-11-04_14-35-01.json
βββ temp/ # For concurrency-safe atomic file writesTihs is the hierarcy that enables rebuilding a file directory.
- Snapshot (
snapshots/*.json): The top-level entry point. It's a simple JSON file that points to the root Tree object of a specific backup.
{
"time": "2025-11-04T14:35:01Z",
"source_path": "/home/user/Videos/vids",
"root_tree_id": "d122ab..."
}- Tree (
metadata/[hash]): Represents a single directory. It's a map of filenames to their metadata, pointing to either a FileManifest (for files) or another Tree (for subdirectories).
{
"type": "tree",
"entries": {
"video1.mp4": {
"type": "file",
"mode": 420,
"mod_time": "...",
"metadata_id": "4a52cc..."
},
"archive/": {
"type": "tree",
"mode": 493,
"mod_time": "...",
"metadata_id": "b789a0..."
}
}
}- FileManifest (
metadata/[hash]): Represents a single file. It's an ordered list of the chunk hashes that, when combined, rebuild the original file.
{
"type": "file",
"chunks": ["f293ab...", "e571cd...", "8921ef..."]
}The backup process is a recursive, bottom-up walk of the filesystem. To build a Tree object for a directory, it must first know the hashes of all its children.
- File: When
walkNodehits a file, it calls processFile.
- The file is streamed into the chunker.
- The chunker produces variable-sized chunks.
- Each chunk is hashed and saved to chunks/ (e.g., f293ab...).
- A FileManifest is created with the list of chunk hashes.
- The FileManifest is marshaled to JSON, hashed, and saved to metadata/ (e.g., 4a52cc...).
- This final manifest hash (4a52cc...) is returned.
- Directory: When
walkNodehits a directory, it waits for all of its children to be processed (in parallel).
- It collects all the returned hashes (e.g., "video1.mp4" -> "4a52cc...").
- It assembles these into a single Tree object.
- This Tree is marshaled to JSON, hashed, and saved to metadata/.
- This final Tree hash is returned to its parent.
- Finish: This continues all the way up to the root, producing a single
root_tree_id, which is then saved in a new Snapshot.
The restore process is the direct inverse... i.e, a recursive, top-down walk.
-
The user picks a Snapshot, giving us the
root_tree_id. -
restoreNodeloads the root Tree object from metadata/. -
For each entry in the Tree:
- If
type== "tree": It calls os.Mkdir to create the directory, then recurses into restoreNode with the child's metadata_id. - If
type== "file": It loads the FileManifest, calls os.Create to make the new file, and then loops through the chunks list. For each chunk hash, it reads from chunks/ and writes the data to the file, in order.
- Finally,
os.Chmodsets the correct permissions.
The serial version would be slow and bottlenecked by file I/O. Yoink uses golang.org/x/sync/errgroup to parallelize this "walk" for both backup and restore.
- When
walkNode(orrestoreNode) enters a directory with 100 files, it does not loop 100 times. - Instead, it launches 100 separate goroutines using g.Go(func() ...).
- Each goroutine processes its own file or subdirectory independently and concurrently.
errgroupautomatically manages all of this:- Concurrency: It handles running many goroutines at once.
- Error Handling: The first goroutine to return an error (e.g., "file not readable") will immediately signal all other goroutines in that group to stop.
- Cancellation: This is done via the context.Context that is passed down, which all goroutines check.
- Result: During a backup, 16 files might be getting chunked, 8 files might be getting hashed, and 12 files might be getting saved to disk, all at the exact same time. This allows Yoink to saturate all available CPU cores and I/O bandwidth, resulting in benchmarked 500%+ CPU usage and incredible speed.
Here is a test of Yoink backing up a 2.4GB directory of video files and restoring them.
First, here is the ls -lh and du -h of the source directory, ~/Videos/vids. It contains 30 video files, totaling 2.4GB.
Next, we run the time bin/yoink backup command.
Look at those benchmark numbers:
- Total Time: The entire 2.4GB directory was backed up in just 4.271 seconds.
- CPU Usage:
589% cpu. This is the most important number. We are saturating almost 6 full CPU cores to hash and chunk multiple files at once.
Now, we restore that 2.4GB of data to a new, empty directory.
The restore is also incredibly fast:
- Total Time: The entire 2.4GB directory was restored in 4.147 seconds.
- CPU Usage:
110% cpu. This shows the restore is also running in parallel, reading multiple chunks and writing multiple files at the same time. The restore is I/O-bound (limited by disk speed), and these results mean that we are successfully maxing out the disk's capability.
Finally, we run a diff between the original source directory and the newly restored one.
No output.
The restored data is bit-for-bit identical to the original. This proves the entire end-to-end pipelineβchunking, hashing, saving, loading, and rebuildingβis 100% correct.
- Client-Side Encryption: Encrypting all chunks and metadata with AES-GCM before they are saved, using a password-derived key.
- Compression: Using zstd to compress chunks on the fly for even greater space savings.
- Pruning & Garbage Collection: A yoink prune command to safely remove old snapshots and delete any chunks they (and only they) used.
- Cloud Storage Backends: Abstracting the repository to an interface to support saving to Google Drive, Dropbox, and S3.
- GUI: A GTK frontend for the application.
Pull requests for bug fixes or new features (especially those on the roadmap!) are always welcome.