Duplicacy vs Duplicati

I think you can just use a local storage and you can complete the run in minutes rather than hours.

Nope, the idea is to do an off-site backup. I already do a local backup.

I’ll try 128k with Dropbox…

Very interesting results! Thangs a lot, towerbr, for the testing.

So the smaller chunk size leads to better results in the case of db files. But since the default chunk size is much bigger, I assume that for other types of backups, the smaller chunk size will create other problems? Does this mean I should exclude db files (does anyone have a regex?) from all my backups and create a separate backup for those (possibly one for each repository)? Not very convenient, but I guess we’re seeing a clear downside of the duplicacy’s killer feature, cross repository deduplication…

Another question concerning duplicacy’s storage use:

In the end, the space used in the backend (contemplating the 3 versions, of course) was:

Duplicati: 696 Mb Duplicacy: 1,117 Mb

Unless I’m misunderstanding something, that difference will we significantly reduced, once you run a prune operation, right?

But then again, you will only save space at the cost of eliminating a revision compared to duplicati. So the original comparison is fair. Ironically, this means that duplicati is actually doing a (much) better job at deduplication between revisions…

Hm, this actually further worsens the space use disadvantage of duplicacy: I believe the original tests posted on the duplicati forum did not include multiple revisions. Now we know that the difference in storage use will actually increase exponentially over time (i.e. with every revision), especially when small changes are made in large files.

Does this mean I should exclude db files (does anyone have a regex?)
from all my backups and create a separate backup for those (possibly one
for each repository)? Not very convenient…

Exact! In the case of this test, I think the solution would be to make an init with the normal chunk size (4M) for the Evernote folder with an exclude for the database, and an add with smaller chunk (128k maybe) to contemplate only the database via include.

The problem is that in the case of Evernote it is very easy to separate via include / exclude pattern because it is only one SQLite file. But for other applications, with multiple databases, it is not practical. And it’s also not practical for mbox files, for example.

So, in the end, I would leave with only one configuration (128k) applied to the whole folder (without add), which is what I am testing now.

… the difference in storage use will actually increase exponentially
over time (i.e. with every revision), especially when small changes
are made in large files.

This is exactly my main concern.

This is exactly my main concern.

Are you worried that it might be so and are testing it, or do you think it is so but are sticking to duplicacy for other reasons?

Are you worried that it might be so and are testing it, or do you think it is so but
are sticking to duplicacy for other reasons?

I’m testing Duplicacy and Duplicati for backing up large files. At the moment I have some jobs of each one running daily. I haven’t decided yet.

In a very resumed way:

Duplicati has more features (which on the other hand exposes it to more failure points) and seems to me to deal better with large files (based on the performance of my jobs and this test so far). But the use of local databases and the constant warnings are the weak points.

Duplicacy seems to me a simpler and more mature software, with some configuration difficulty (init, add, storages, repositories, etc), but it starts to worry me about backups of large files.

Does this mean I should exclude db files (does anyone have a regex?) from all my backups and create a separate backup for those (possibly one for each repository)? Not very convenient…

I think an average chunk size of 1M should be good enough for general cases. The decision of 4M was mostly due to the considerations to reduce the number of chunks (before 2.0.10 all the cloud storage used a flat chunk directory) and to reduce the overhead ratio (single thread uploading and downloading in version 1). Now that we have a nested chunk structure for all storages, and multi-threading support, it perhaps makes sense to change the default size to 1M. There is another use case where 1M did much better than 4M:

Duplicacy, however, was never to achieve the best deduplication ratio for a single repository. I knew from the beginning that by adopting a relatively large chunk size, we are going to lost the deduplication battle to competitors. But this is a tradeoff we have to make, because the main goal is the cross-client deduplication, which is completely worth the lose in deduplication efficiency on a single computer. For instance, suppose that you need to back up your Evernote database on two computers, then the storage saving brought by Duplicacy already outweighs the wasted space due to a much larger chunk size. Even if you don’t need to back up two computers, you can still benefit from this unique feature of Duplicacy – you can seed the initial backup on a computer with a faster internet connection, and then continue to run the regular backups on the computer with a slower connection.

I also wanted to add that the variable-size chunking algorithm used by Duplicacy is actually more stable than the fixed-size chunking algorithm in Duplicati, even on a single computer. Fixed-size chunking is susceptible to deletions and insertions, so when a few bytes are added to or removed from a large file, all previously split chunks after the insertion/deletion point will be affected due to changed offsets, and as a result a new set of chunks must be created. Such files include dump files from databases and unzipped tarball files.

1 Like

you can seed the initial backup on a computer with a faster internet connection, and then
continue to run the regular backups on the computer with a slower connection.

This is indeed an advantage. However, to take advantage of it, the files must be almost identical on both computers, which does not apply to Evernote. But there are certainly applications for this kind of use.

The new initial backup with 128k chunks is finished. Took several hours and was interrupted several times, but I think it’s Dropbox’s “fault”.

It ended like this:

Files: 348 total, 892,984K bytes; 39 new, 885,363K bytes
File chunks: 7288 total, 892,984K bytes; 2362 new, 298,099K bytes, 251,240K bytes uploaded
Metadata chunks: 6 total, 598K bytes; 6 new, 598K bytes, 434K bytes uploaded
All chunks: 7294 total, 893,583K bytes; 2368 new, 298,697K bytes, 251,674K bytes uploaded

Interesting that the log above shows a total size of ~890Mb but the direct verification with Rclone in the remote folder shows 703Mb (?!).

What I’m going to do now is to run these two jobs (Duplicati and Duplicacy) for a few days of “normal” use and follow the results in a spreadsheet, so I’ll go back here and post the results.

The new initial backup with 128k chunks is finished. Took several hours and was interrupted several times, but I think it’s Dropbox’s “fault”.

I think we should retry on EOF when Dropbox closes the connection.

Interesting that the log above shows a total size of ~890Mb but the direct verification with Rclone in the remote folder shows 703Mb (?!).

The original size is 893Mb and what Rclone shows is the total size after compression and encryption.

I think we should retry on EOF when Dropbox closes the connection.

I agree, that’s what I did in this last backup and in the previous one (with 1M chunk). If the attempt was made shortly after the interruption Dropbox would not accept, but if you wait a few minutes and run the command the upload worked again.

The original size is 893Mb and what Rclone shows is the total size after compression and encryption.

Ah, understood! I’m using Rclone in the script right after the backups, to obtain the size of the two remote folders.

It’s a shame that neither Dropbox nor Google Drive provides an easy way to view the size of a folder. In Dropbox you have to do several steps in the web interface, and Google drive is even worse, with the view by the quota.

I think an average chunk size of 1M should be good enough for general cases.

So by “average chunk size” you mean that it should not be 1M fixed chunks? Is there a way of changing the default chunk size on a specific computer?

The new initial backup with 128k chunks is finished.

BTW: these are fixed size chunks, right?

It ended like this:

Files: 348 total, 892,984K bytes; 39 new, 885,363K bytes
File chunks: 7288 total, 892,984K bytes; 2362 new, 298,099K bytes, 251,240K bytes uploaded
Metadata chunks: 6 total, 598K bytes; 6 new, 598K bytes, 434K bytes uploaded
All chunks: 7294 total, 893,583K bytes; 2368 new, 298,697K bytes, 251,674K bytes uploaded

So what exactly does this mean in terms of comparison with both duplicati and duplicacy with 1MB chunks?

Or is it rather the values from Rclone that should be compared?

Or is the initial backup size not dependent on chunk size all?

BTW: these are fixed size chunks, right?

Yes, I used the command: duplicacy init -e -c 128K -max 128K -min 128K ...

So what exactly does this mean in terms of comparison with both duplicati and duplicacy with 1MB chunks?

Gilbert said that Duplicati uses 100k (fixed) chunks.

In my tests with Duplicacy above:

1st setting (4M chunks): 176 chunks - initial upload: 02:03

2nd setting (1M chunks): 1206 chunks - initial upload: 02:39

3rd setting (128k chunks): 7288 chunks - initial upload: several hours

Is this your question?

Or is the initial backup size not dependent on chunk size all?

The smaller the size of the chunks, the greater the number of chunks. Then the number of upload requests is also greater, which makes total uploading more time-consuming. And in the specific case of Dropbox, it seems to “dislike” so many requests.

Is this your question?
No, I was just trying to compare initial storage use in all the different scenarios we have so far. Something like:

duplicati (default setting) uses w GB
duplicacy (4M chunks, variable) uses x GB
duplicacy (1M chunks) uses y GB
duplicacy (128k chunks) uses z GB

@gchen: since the chunk size turns out to be such a crucial decision at the very beginning and which cannot be changed unless you want to start uploading from scratch, could you provide some more information about the trade offs involved when choosing chunk size. So for example:, what difference will it make if choose 1M vs 512k vs 128k chunksize and whether I choose fixed or variable chunk size?

duplicati (default setting) uses w GB duplicacy (4M chunks, variable) uses x GB
duplicacy (1M chunks) uses y GB duplicacy (128k chunks) uses z GB

So I think that’s this you are asking for:

Duplicati: initial upload: 01:55 - 672 Mb

Duplicacy:

1st setting (4M chunks, variable): 176 chunks - initial upload: 02:03 - 691 Mb

2nd setting (1M chunks, fixed): 1206 chunks - initial upload: 02:39 - (I didn’t measure with Rclone)

3rd setting (128k chunks, fixed): 7288 chunks - initial upload: several hours - 703 Mb

Dropbox is perhaps the least thread friendly storage, even after Hubic (which is the slowest).

I ran a test to upload 1GB file to Wasabi. Here is the result with the default 4M chunk size (with 16 threads):

Uploaded chunk 201 size 2014833, 37.64MB/s 00:00:01 99.2%
Uploaded chunk 202 size 7967204, 36.57MB/s 00:00:01 100.0%
Uploaded 1G (1073741824)
Backup for /home/gchen/repository at revision 1 completed
Files: 1 total, 1048,576K bytes; 1 new, 1048,576K bytes
File chunks: 202 total, 1048,576K bytes; 202 new, 1048,576K bytes, 1,028M bytes uploaded
Metadata chunks: 3 total, 15K bytes; 3 new, 15K bytes, 14K bytes uploaded
All chunks: 205 total, 1,024M bytes; 205 new, 1,024M bytes, 1,028M bytes uploaded
Total running time: 00:00:28

Almost no differences with the chunk size set to 1M:

Uploaded chunk 881 size 374202, 36.49MB/s 00:00:01 99.7%
Uploaded chunk 880 size 2277351, 36.57MB/s 00:00:01 100.0%
Uploaded 1G (1073741824)
Backup for /home/gchen/repository at revision 1 completed
Files: 1 total, 1048,576K bytes; 1 new, 1048,576K bytes
File chunks: 882 total, 1048,576K bytes; 882 new, 1048,576K bytes, 1,028M bytes uploaded
Metadata chunks: 3 total, 64K bytes; 3 new, 64K bytes, 55K bytes uploaded
All chunks: 885 total, 1,024M bytes; 885 new, 1,024M bytes, 1,028M bytes uploaded
Total running time: 00:00:29

But 128K chunk size is much slower:

Uploaded chunk 6739 size 127411, 14.63MB/s 00:00:01 99.9%
Uploaded chunk 6747 size 246758, 14.42MB/s 00:00:01 100.0%
Uploaded 1G (1073741824)
Backup for /home/gchen/repository at revision 1 completed
Removed incomplete snapshot /home/gchen/repository/.duplicacy/incomplete
Files: 1 total, 1048,576K bytes; 1 new, 1048,576K bytes
File chunks: 6762 total, 1048,576K bytes; 6762 new, 1048,576K bytes, 1,028M bytes uploaded
Metadata chunks: 5 total, 486K bytes; 5 new, 486K bytes, 401K bytes uploaded
All chunks: 6767 total, 1,024M bytes; 6767 new, 1,024M bytes, 1,028M bytes uploaded
Total running time: 00:01:11

OK, thanks for sharing! So it’s as I suspected at some point:

Or is the initial backup size not dependent on chunk size all?

No, not really. But now we have that clarified: The (only) disadvantage of the larger chunk files is in subsequent backups of files (esp. large ones) when they are changed. Concretely: with the default settings (4MB chunks), duplicacy went from 691 MB to 1117 MB within three (very minor) revisions, while duplicati went from 672 MB to 696 MB with the same 3 revisions. We don’t know the exact figures for the test with 1MB chunks test, but we assume the waste of space was reduced.

So now we’re anxiously awaiting your results to see how much improvement we can get out of 128kb chunks.

Regarding another extreme scenario: lots of small files changed often, how would that be affected by small vs large and by variable vs fixed chunk size? What can be said about that without running tests?

(Copied from the discussion at https://github.com/gilbertchen/duplicacy/issues/334#issuecomment-360641594):

There is a way to retrospectively check the effect of different chunk sizes on the deduplication efficiency. First, create a duplicate of your original repository in a disposable directory, pointing to the same storage:

mkdir /tmp/repository
cd /tmp/repository
duplicacy init repository_id storage_url

Add two storages (ideally local disks for speed) with different chunk sizes:

duplicacy add -c 1M test1 test1 local_storage1
duplicacy add -c 4M test2 test2 local_storage2

Then check out each revision and back up to both local storages:

duplicacy restore -overwrite -delete -r 1 
duplicacy backup -storage test1
duplicacy backup -storage test2
duplicacy restore -overwrite -delete -r 2
duplicacy backup -storage test1
duplicacy backup -storage test2

Finally check the storage efficiency using the check command:

duplicacy check -tabular -storage test1
duplicacy check -tabular -storage test2

Dropbox is perhaps the least thread friendly storage, even after Hubic (which is the slowest).

I completely agree, but i have the space there “for free”. Dropbox has a “soft limit” of 300,000 files, and I reached that limit (of files) with just 300 Gb. So the remaining 700 Gb were wasted, and i decided to use for backup.

For professional reasons I can not cancel the Dropbox account now, but in the future I’m planning to move backups to Wasabi or other (pCloud looks interesting also). I’ve used Backblaze B2 in the past but the features of the web interface are limited, in my opinion.

Dropbox has a “soft limit” of 300,000 files

I didn’t know this, but here is their explanation:

Out of curiosity, does anyone know if this refers only to the Dropbox Application and it’s Sync behavior, or to the API too? I was thinking about using Dropbox Pro to store a few TB, but won’t bother if there is a file limit, because Duplicacy will chew that up pretty quick.

It sounds like you can store more than 300,000 files, but it becomes an issue if you are trying to Sync those with the App, does that sound right?

1 Like

It sounds like you can store more than 300,000 files, but it becomes an issue if you are trying to
Sync those with the App, does that sound right?

Yes, it’s exactly like that.I have more than 300,000 files there, but I only sync part of them.

does anyone know if this refers only to the Dropbox Application and it’s Sync behavior,
or to the API too?

That’s a good question!