What file type is best for backing up locally (tar vs. gzip vs. zstd) for then backing up to B2?

For some of my applications I have the option, when backing up locally, to

  1. Not compress the backup (backup is a tar file)
  2. Compress with gzip (backup is a tar.gz file)
  3. Compress with zstd (backup is a tar.zst file)

I store all of my backups in a specific folder on my server, and I then point Duplicacy to that folder. Duplicacy then backs up my backups to B2.

My understanding is that Duplicacy GUI defaults to backing up everything to B2 using gzip compression. This is great, but this is also where I have a question.

Knowing that Duplicacy is going to backup my files/folders using gzip, is there a preference in using tar vs. tar.gz, vs. tar.zst for my local files? For example, is Duplicacy able to see files within a tar archive, and not store duplicate them on my B2 storage, therefor saving my storage costs?

My question is purely around what saves the most amount of space when backing up multiple revisions of a backup to B2. Locally, I save the most space with zst, then gzip, and then tar. zst is the fastest, followed closely by tar, and then gzip is about 2x slower then both of those. So my preference would be to store in zst vs. tar locally, since it’s slightly faster then tar, while reducing the size of the backup by about 3X. But if that’s going to end up creating even more data on my B2 backup bucket compared to a normal tar file, I’d rather store a larger file locally, in order to save on storage costs in the long run with B2. Hopefully this makes sense what I’m trying to say.

Compressed data is non-deduplicatable; so to have the best chance of deduplication — don’t compress source data.

How much that matters — depends on how much does the data change. You may want to experiment with your specific data and see what works best.

Duplicacy does not look into archives. It makes a long sausage of all files and shreds it into pieces at some smart boundaries.

Thank you for the quick response!

Knowing that Duplicacy will not look into a tar file, if I can only chose tar, tar.gz, or tar.zst, then tar.zst would be my best option since it would result in the smallest file size.

The next best, could be not archiving files at all, which would allow for deduplication checking. I’d have to test that to see if it creates smaller backups, but it’s the only way I could have de-duplication checking.

Does that sound correct?

Since .tar is not compressed, it’s just all files glued together, from the duplicacy’s deduplication perspective it does not matter if files are in .tar or plain in the filesystem: It deduplicates chunks after shredding that big sausage. I.e. there is no difference whether files were first glued together into tar and then glued to the rest of files or each file got put into sausage independently.

But if tar is compressed — then small changes in the source data can result in sometimes large changes in the compressed data, resulting in too many different new chunks and therefore poorer deduplication ability.

Okay, thank you again!

In this case, I’ll backup everything in an uncompressed tar file. It will take up more space locally, but I don’t care too much about that. It will create that file almost as fast as a zstd compressed archive, but will allow Duplicacy to potentially de-duplicate chunks of the tar file, before compressing as gzip and storing on my B2 bucket.

1 Like

If you don’t care much about local file sizes, I’d echo suggestion of uncompressed tar, this should likely get you the best compression on remote. The edge case is that if you get hardly any dedups between snapshots, then compressing with a better local compressor might be better (as :d: doesn’t maximize compression).

If you care about local storage as well, you may want to experiment with some light compressors (e.g. something like lz4). Sometimes light compression allows for massive savings locally, while still being somewhat deduplicate-able if it runs on blocks. This is often a usecase for disk images.