https://blobbackup.com/10_gb_benchmark.php
Thought this might be of interest here.
Are you the developer of BlobBackup?
Yes I am. Was an Arq user for a bit before I started working on it.
Thanks for making this and sharing it with us. What compression algorithm does BlobBackup use?
Zstandard. It looks like Duplicacy uses LZ4 (which would explain the speed/size of backups)?. Was one of the maintainers of zstd on github for a while and in my experience, the size benefits of zstd almost always outweighs the speed gain on lz4 (especially since zstd (negative levels) are almost as fast).
Have you thought about using another algorithm for Duplicacy? There is a nice comparison here btw: GitHub - inikep/lzbench: lzbench is an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors
Heard about zstd before. I think one of the main advantages is that adjusting the compression level gives a real choice between compression times and space. Compared to many other algos.
LZ4 is still faster at lower compression levels. IMO, it’s more important to have fast backups than eek out a little bit of (local) storage space or worry too much about restore speeds… However, zstd would be awesome to have as an additional algorithm - for secondary copies to cloud storages. (Or for initial storage if you didn’t mind waiting a bit longer.) I’ve noticed a few projects out there that combine LZ4 and zstd and choose between them based on level and content type.
The advantages are obvious. A local storage could use LZ4 for fast local workstation backup. With Erasure Coding and perhaps RC4 encryption protecting that copy. Copying this storage to a copy-compatible cloud-based storage could use zstd and do away with the Erasure Coding, saving even more space.
On a provider like B2, where you have to pay for downloads, a restore would be so much faster thanks to the compressed size. And you could adjust the compression levels to taste.
Very interesting!
Unfortunately it doesn’t seem zstd has a pure go implementation so it is unlikely to make to Duplicacy.
@bimba while you’re here do you mind explaining a little bit about the core algorithms of BlobBackup? Such as, is chunking fixed-size or variable size (the choice of which affects the backup speed), and how do you pack small files?
I spent some time looking through the documentation out of personal interest (originally saw a post somewhere on reddit about BlobBackup), but @bimba please correct me if I’m wrong about how it works.
For cloud storage, it seems like BlobBackup’s defaults (with smaller chunk size, and no packing of small files?) would result in more chunks and more API calls – relative to Duplicacy’s defaults.
Unfortunately it doesn’t seem zstd has a pure go implementation so it is unlikely to make to Duplicacy.
Ahh that’s too bad. I believe there are some wrappers for the reference C implementation though no? GitHub - DataDog/zstd: Zstd wrapper for Go seems like an example of one. I haven’t programmed in go before so I’m not sure what the relevant factors are for using a library but is it necessary that the library be pure go?
@leerspace, you’re pretty much right. I need to properly update that docs page…I updated the default blob size to 1MB from 256 KB but it looks like I missed an edit. Let me add some details.
The algorithm is something like this
{
{"type": "file", "path": ..., "blobs": [<sha256_1>, <sha256_2>...]},
{"type", "dir", "path": ...}
...
}
Pretty simple. It’s just a one pas algorithm. No variable sized chunking although that might change in the future.
Oh and compress every file I mention above using zstd and encrypt using the Open SLL crypto library before storing.
@bimba thanks for explanation. Duplicacy supports fixed-sized chunking too, which is known to be faster than variable-sized chunking, because there is no need to run the rolling checksum algorithm. But I feel fixed-sized chunking is more suitable for use cases where there aren’t many small files (such as backing up virtual machine images). If you don’t pack small files, then each file will require 2 api calls (lookup and upload) no matter how small they are, and this is going to be a significant source of overhead.
Yeah, the api overhead (especially on slower backends like Google Drive) can really hurt performance. A lot of my users have been using Wasabi (which is fast) so they haven’t been getting killed in term of speed thankfully.
I’d like to avoid adding small file packing and variable length chunking to prevent adding complexity but we’ll see, it might become necessary at some point for certain use cases…