Size/Speed Benchmark: Duplicacy vs. Arq 6 vs. Duplicati vs. QBackup vs. BlobBackup

bimba · 26 October 2020 21:17

https://blobbackup.com/10_gb_benchmark.php

Thought this might be of interest here.

gchen · 27 October 2020 17:59

Are you the developer of BlobBackup?

bimba · 27 October 2020 20:34

Yes I am. Was an Arq user for a bit before I started working on it.

gchen · 28 October 2020 00:17

Thanks for making this and sharing it with us. What compression algorithm does BlobBackup use?

bimba · 28 October 2020 17:12

Zstandard. It looks like Duplicacy uses LZ4 (which would explain the speed/size of backups)?. Was one of the maintainers of zstd on github for a while and in my experience, the size benefits of zstd almost always outweighs the speed gain on lz4 (especially since zstd (negative levels) are almost as fast).

Have you thought about using another algorithm for Duplicacy? There is a nice comparison here btw: GitHub - inikep/lzbench: lzbench is an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors

Droolio · 28 October 2020 21:49

Heard about zstd before. I think one of the main advantages is that adjusting the compression level gives a real choice between compression times and space. Compared to many other algos.

LZ4 is still faster at lower compression levels. IMO, it’s more important to have fast backups than eek out a little bit of (local) storage space or worry too much about restore speeds… However, zstd would be awesome to have as an additional algorithm - for secondary copies to cloud storages. (Or for initial storage if you didn’t mind waiting a bit longer.) I’ve noticed a few projects out there that combine LZ4 and zstd and choose between them based on level and content type.

The advantages are obvious. A local storage could use LZ4 for fast local workstation backup. With Erasure Coding and perhaps RC4 encryption protecting that copy. Copying this storage to a copy-compatible cloud-based storage could use zstd and do away with the Erasure Coding, saving even more space.

On a provider like B2, where you have to pay for downloads, a restore would be so much faster thanks to the compressed size. And you could adjust the compression levels to taste.

Very interesting!

gchen · 31 October 2020 17:39

Unfortunately it doesn’t seem zstd has a pure go implementation so it is unlikely to make to Duplicacy.

@bimba while you’re here do you mind explaining a little bit about the core algorithms of BlobBackup? Such as, is chunking fixed-size or variable size (the choice of which affects the backup speed), and how do you pack small files?

leerspace · 1 November 2020 01:55

I spent some time looking through the documentation out of personal interest (originally saw a post somewhere on reddit about BlobBackup), but @bimba please correct me if I’m wrong about how it works.

BlobBackup uses fixed size chunking only
Default max chunk sizes are 256 KB or 1MB? The documentation seems ambiguous to me

one place says “Data in the blobs directory will be between 0 and BLOB_SIZE (defaults to 1MB)”
another says “Blob size: the max size of the blobs that files are split into before uploading. By default, this will be 256 KB.”

Small files don’t seem to be packed

For cloud storage, it seems like BlobBackup’s defaults (with smaller chunk size, and no packing of small files?) would result in more chunks and more API calls – relative to Duplicacy’s defaults.

bimba · 1 November 2020 21:14

@gchen

Unfortunately it doesn’t seem zstd has a pure go implementation so it is unlikely to make to Duplicacy.

Ahh that’s too bad. I believe there are some wrappers for the reference C implementation though no? GitHub - DataDog/zstd: Zstd wrapper for Go seems like an example of one. I haven’t programmed in go before so I’m not sure what the relevant factors are for using a library but is it necessary that the library be pure go?

@leerspace, you’re pretty much right. I need to properly update that docs page…I updated the default blob size to 1MB from 256 KB but it looks like I missed an edit. Let me add some details.

The algorithm is something like this

Read 1MB fixed size blobs from each file
If the file is less than 1MB, then just read the whole file (so no packing)
Take a salted SHA256 of the content and then upload the blob into /blobs/SHA256 if it doesn’t already exist
All the while, accumulate details in an object that looks something like this and serialize/save it as a json file when all the file paths are traversed (the snapshot file)

{
  {"type": "file", "path": ..., "blobs": [<sha256_1>, <sha256_2>...]},
  {"type", "dir", "path": ...}
  ...
}

Pretty simple. It’s just a one pas algorithm. No variable sized chunking although that might change in the future.

bimba · 1 November 2020 21:15

Oh and compress every file I mention above using zstd and encrypt using the Open SLL crypto library before storing.

gchen · 2 November 2020 16:53

@bimba thanks for explanation. Duplicacy supports fixed-sized chunking too, which is known to be faster than variable-sized chunking, because there is no need to run the rolling checksum algorithm. But I feel fixed-sized chunking is more suitable for use cases where there aren’t many small files (such as backing up virtual machine images). If you don’t pack small files, then each file will require 2 api calls (lookup and upload) no matter how small they are, and this is going to be a significant source of overhead.

bimba · 3 November 2020 23:49

Yeah, the api overhead (especially on slower backends like Google Drive) can really hurt performance. A lot of my users have been using Wasabi (which is fast) so they haven’t been getting killed in term of speed thankfully.

I’d like to avoid adding small file packing and variable length chunking to prevent adding complexity but we’ll see, it might become necessary at some point for certain use cases…