Size/Speed Benchmark: Duplicacy vs. Arq 6 vs. Duplicati vs. QBackup vs. BlobBackup

https://blobbackup.com/10_gb_benchmark.php

Thought this might be of interest here.

Are you the developer of BlobBackup?

Yes I am. Was an Arq user for a bit before I started working on it.

Thanks for making this and sharing it with us. What compression algorithm does BlobBackup use?

Zstandard. It looks like Duplicacy uses LZ4 (which would explain the speed/size of backups)?. Was one of the maintainers of zstd on github for a while and in my experience, the size benefits of zstd almost always outweighs the speed gain on lz4 (especially since zstd (negative levels) are almost as fast).

Have you thought about using another algorithm for Duplicacy? There is a nice comparison here btw: GitHub - inikep/lzbench: lzbench is an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors

3 Likes

Heard about zstd before. I think one of the main advantages is that adjusting the compression level gives a real choice between compression times and space. Compared to many other algos.

LZ4 is still faster at lower compression levels. IMO, it’s more important to have fast backups than eek out a little bit of (local) storage space or worry too much about restore speeds… However, zstd would be awesome to have as an additional algorithm - for secondary copies to cloud storages. (Or for initial storage if you didn’t mind waiting a bit longer.) I’ve noticed a few projects out there that combine LZ4 and zstd and choose between them based on level and content type.

The advantages are obvious. A local storage could use LZ4 for fast local workstation backup. With Erasure Coding and perhaps RC4 encryption protecting that copy. Copying this storage to a copy-compatible cloud-based storage could use zstd and do away with the Erasure Coding, saving even more space.

On a provider like B2, where you have to pay for downloads, a restore would be so much faster thanks to the compressed size. And you could adjust the compression levels to taste.

Very interesting!

1 Like

Unfortunately it doesn’t seem zstd has a pure go implementation so it is unlikely to make to Duplicacy.

@bimba while you’re here do you mind explaining a little bit about the core algorithms of BlobBackup? Such as, is chunking fixed-size or variable size (the choice of which affects the backup speed), and how do you pack small files?

I spent some time looking through the documentation out of personal interest (originally saw a post somewhere on reddit about BlobBackup), but @bimba please correct me if I’m wrong about how it works.

  1. BlobBackup uses fixed size chunking only
  2. Default max chunk sizes are 256 KB or 1MB? The documentation seems ambiguous to me
  • one place says “Data in the blobs directory will be between 0 and BLOB_SIZE (defaults to 1MB)”
  • another says “Blob size: the max size of the blobs that files are split into before uploading. By default, this will be 256 KB.”
  1. Small files don’t seem to be packed

For cloud storage, it seems like BlobBackup’s defaults (with smaller chunk size, and no packing of small files?) would result in more chunks and more API calls – relative to Duplicacy’s defaults.

@gchen

Unfortunately it doesn’t seem zstd has a pure go implementation so it is unlikely to make to Duplicacy.

Ahh that’s too bad. I believe there are some wrappers for the reference C implementation though no? GitHub - DataDog/zstd: Zstd wrapper for Go seems like an example of one. I haven’t programmed in go before so I’m not sure what the relevant factors are for using a library but is it necessary that the library be pure go?

@leerspace, you’re pretty much right. I need to properly update that docs page…I updated the default blob size to 1MB from 256 KB but it looks like I missed an edit. Let me add some details.

The algorithm is something like this

  • Read 1MB fixed size blobs from each file
  • If the file is less than 1MB, then just read the whole file (so no packing)
  • Take a salted SHA256 of the content and then upload the blob into /blobs/SHA256 if it doesn’t already exist
  • All the while, accumulate details in an object that looks something like this and serialize/save it as a json file when all the file paths are traversed (the snapshot file)
{
  {"type": "file", "path": ..., "blobs": [<sha256_1>, <sha256_2>...]},
  {"type", "dir", "path": ...}
  ...
}

Pretty simple. It’s just a one pas algorithm. No variable sized chunking although that might change in the future.

2 Likes

Oh and compress every file I mention above using zstd and encrypt using the Open SLL crypto library before storing.

@bimba thanks for explanation. Duplicacy supports fixed-sized chunking too, which is known to be faster than variable-sized chunking, because there is no need to run the rolling checksum algorithm. But I feel fixed-sized chunking is more suitable for use cases where there aren’t many small files (such as backing up virtual machine images). If you don’t pack small files, then each file will require 2 api calls (lookup and upload) no matter how small they are, and this is going to be a significant source of overhead.

3 Likes

Yeah, the api overhead (especially on slower backends like Google Drive) can really hurt performance. A lot of my users have been using Wasabi (which is fast) so they haven’t been getting killed in term of speed thankfully.

I’d like to avoid adding small file packing and variable length chunking to prevent adding complexity but we’ll see, it might become necessary at some point for certain use cases…

1 Like