Duplicacy vs Duplicati

gchen · 23 January 2018 02:14

Does this mean I should exclude db files (does anyone have a regex?) from all my backups and create a separate backup for those (possibly one for each repository)? Not very convenient…

I think an average chunk size of 1M should be good enough for general cases. The decision of 4M was mostly due to the considerations to reduce the number of chunks (before 2.0.10 all the cloud storage used a flat chunk directory) and to reduce the overhead ratio (single thread uploading and downloading in version 1). Now that we have a nested chunk structure for all storages, and multi-threading support, it perhaps makes sense to change the default size to 1M. There is another use case where 1M did much better than 4M:

Duplicacy, however, was never to achieve the best deduplication ratio for a single repository. I knew from the beginning that by adopting a relatively large chunk size, we are going to lost the deduplication battle to competitors. But this is a tradeoff we have to make, because the main goal is the cross-client deduplication, which is completely worth the lose in deduplication efficiency on a single computer. For instance, suppose that you need to back up your Evernote database on two computers, then the storage saving brought by Duplicacy already outweighs the wasted space due to a much larger chunk size. Even if you don’t need to back up two computers, you can still benefit from this unique feature of Duplicacy – you can seed the initial backup on a computer with a faster internet connection, and then continue to run the regular backups on the computer with a slower connection.

I also wanted to add that the variable-size chunking algorithm used by Duplicacy is actually more stable than the fixed-size chunking algorithm in Duplicati, even on a single computer. Fixed-size chunking is susceptible to deletions and insertions, so when a few bytes are added to or removed from a large file, all previously split chunks after the insertion/deletion point will be affected due to changed offsets, and as a result a new set of chunks must be created. Such files include dump files from databases and unzipped tarball files.

towerbr · 23 January 2018 12:01

you can seed the initial backup on a computer with a faster internet connection, and then
continue to run the regular backups on the computer with a slower connection.

This is indeed an advantage. However, to take advantage of it, the files must be almost identical on both computers, which does not apply to Evernote. But there are certainly applications for this kind of use.

towerbr · 23 January 2018 12:24

The new initial backup with 128k chunks is finished. Took several hours and was interrupted several times, but I think it’s Dropbox’s “fault”.

It ended like this:

Files: 348 total, 892,984K bytes; 39 new, 885,363K bytes
File chunks: 7288 total, 892,984K bytes; 2362 new, 298,099K bytes, 251,240K bytes uploaded
Metadata chunks: 6 total, 598K bytes; 6 new, 598K bytes, 434K bytes uploaded
All chunks: 7294 total, 893,583K bytes; 2368 new, 298,697K bytes, 251,674K bytes uploaded

Interesting that the log above shows a total size of ~890Mb but the direct verification with Rclone in the remote folder shows 703Mb (?!).

What I’m going to do now is to run these two jobs (Duplicati and Duplicacy) for a few days of “normal” use and follow the results in a spreadsheet, so I’ll go back here and post the results.

gchen · 23 January 2018 15:07

The new initial backup with 128k chunks is finished. Took several hours and was interrupted several times, but I think it’s Dropbox’s “fault”.

I think we should retry on EOF when Dropbox closes the connection.

Interesting that the log above shows a total size of ~890Mb but the direct verification with Rclone in the remote folder shows 703Mb (?!).

The original size is 893Mb and what Rclone shows is the total size after compression and encryption.

towerbr · 23 January 2018 15:34

I think we should retry on EOF when Dropbox closes the connection.

I agree, that’s what I did in this last backup and in the previous one (with 1M chunk). If the attempt was made shortly after the interruption Dropbox would not accept, but if you wait a few minutes and run the command the upload worked again.

The original size is 893Mb and what Rclone shows is the total size after compression and encryption.

Ah, understood! I’m using Rclone in the script right after the backups, to obtain the size of the two remote folders.

It’s a shame that neither Dropbox nor Google Drive provides an easy way to view the size of a folder. In Dropbox you have to do several steps in the web interface, and Google drive is even worse, with the view by the quota.

Christoph · 23 January 2018 21:07

I think an average chunk size of 1M should be good enough for general cases.

So by “average chunk size” you mean that it should not be 1M fixed chunks? Is there a way of changing the default chunk size on a specific computer?

The new initial backup with 128k chunks is finished.

BTW: these are fixed size chunks, right?

It ended like this:

Files: 348 total, 892,984K bytes; 39 new, 885,363K bytes
File chunks: 7288 total, 892,984K bytes; 2362 new, 298,099K bytes, 251,240K bytes uploaded
Metadata chunks: 6 total, 598K bytes; 6 new, 598K bytes, 434K bytes uploaded
All chunks: 7294 total, 893,583K bytes; 2368 new, 298,697K bytes, 251,674K bytes uploaded

So what exactly does this mean in terms of comparison with both duplicati and duplicacy with 1MB chunks?

Or is it rather the values from Rclone that should be compared?

Or is the initial backup size not dependent on chunk size all?

towerbr · 23 January 2018 23:12

BTW: these are fixed size chunks, right?

Yes, I used the command: duplicacy init -e -c 128K -max 128K -min 128K ...

So what exactly does this mean in terms of comparison with both duplicati and duplicacy with 1MB chunks?

Gilbert said that Duplicati uses 100k (fixed) chunks.

In my tests with Duplicacy above:

1st setting (4M chunks): 176 chunks - initial upload: 02:03

2nd setting (1M chunks): 1206 chunks - initial upload: 02:39

3rd setting (128k chunks): 7288 chunks - initial upload: several hours

Is this your question?

Or is the initial backup size not dependent on chunk size all?

The smaller the size of the chunks, the greater the number of chunks. Then the number of upload requests is also greater, which makes total uploading more time-consuming. And in the specific case of Dropbox, it seems to “dislike” so many requests.

Christoph · 24 January 2018 22:48

Is this your question?
No, I was just trying to compare initial storage use in all the different scenarios we have so far. Something like:

duplicati (default setting) uses w GB
duplicacy (4M chunks, variable) uses x GB
duplicacy (1M chunks) uses y GB
duplicacy (128k chunks) uses z GB

@gchen: since the chunk size turns out to be such a crucial decision at the very beginning and which cannot be changed unless you want to start uploading from scratch, could you provide some more information about the trade offs involved when choosing chunk size. So for example:, what difference will it make if choose 1M vs 512k vs 128k chunksize and whether I choose fixed or variable chunk size?

towerbr · 25 January 2018 00:23

duplicati (default setting) uses w GB duplicacy (4M chunks, variable) uses x GB
duplicacy (1M chunks) uses y GB duplicacy (128k chunks) uses z GB

So I think that’s this you are asking for:

Duplicati: initial upload: 01:55 - 672 Mb

Duplicacy:

1st setting (4M chunks, variable): 176 chunks - initial upload: 02:03 - 691 Mb

2nd setting (1M chunks, fixed): 1206 chunks - initial upload: 02:39 - (I didn’t measure with Rclone)

3rd setting (128k chunks, fixed): 7288 chunks - initial upload: several hours - 703 Mb

gchen · 25 January 2018 19:21

Dropbox is perhaps the least thread friendly storage, even after Hubic (which is the slowest).

I ran a test to upload 1GB file to Wasabi. Here is the result with the default 4M chunk size (with 16 threads):

Uploaded chunk 201 size 2014833, 37.64MB/s 00:00:01 99.2%
Uploaded chunk 202 size 7967204, 36.57MB/s 00:00:01 100.0%
Uploaded 1G (1073741824)
Backup for /home/gchen/repository at revision 1 completed
Files: 1 total, 1048,576K bytes; 1 new, 1048,576K bytes
File chunks: 202 total, 1048,576K bytes; 202 new, 1048,576K bytes, 1,028M bytes uploaded
Metadata chunks: 3 total, 15K bytes; 3 new, 15K bytes, 14K bytes uploaded
All chunks: 205 total, 1,024M bytes; 205 new, 1,024M bytes, 1,028M bytes uploaded
Total running time: 00:00:28

Almost no differences with the chunk size set to 1M:

Uploaded chunk 881 size 374202, 36.49MB/s 00:00:01 99.7%
Uploaded chunk 880 size 2277351, 36.57MB/s 00:00:01 100.0%
Uploaded 1G (1073741824)
Backup for /home/gchen/repository at revision 1 completed
Files: 1 total, 1048,576K bytes; 1 new, 1048,576K bytes
File chunks: 882 total, 1048,576K bytes; 882 new, 1048,576K bytes, 1,028M bytes uploaded
Metadata chunks: 3 total, 64K bytes; 3 new, 64K bytes, 55K bytes uploaded
All chunks: 885 total, 1,024M bytes; 885 new, 1,024M bytes, 1,028M bytes uploaded
Total running time: 00:00:29

But 128K chunk size is much slower:

Uploaded chunk 6739 size 127411, 14.63MB/s 00:00:01 99.9%
Uploaded chunk 6747 size 246758, 14.42MB/s 00:00:01 100.0%
Uploaded 1G (1073741824)
Backup for /home/gchen/repository at revision 1 completed
Removed incomplete snapshot /home/gchen/repository/.duplicacy/incomplete
Files: 1 total, 1048,576K bytes; 1 new, 1048,576K bytes
File chunks: 6762 total, 1048,576K bytes; 6762 new, 1048,576K bytes, 1,028M bytes uploaded
Metadata chunks: 5 total, 486K bytes; 5 new, 486K bytes, 401K bytes uploaded
All chunks: 6767 total, 1,024M bytes; 6767 new, 1,024M bytes, 1,028M bytes uploaded
Total running time: 00:01:11

Christoph · 25 January 2018 19:38

OK, thanks for sharing! So it’s as I suspected at some point:

Or is the initial backup size not dependent on chunk size all?

No, not really. But now we have that clarified: The (only) disadvantage of the larger chunk files is in subsequent backups of files (esp. large ones) when they are changed. Concretely: with the default settings (4MB chunks), duplicacy went from 691 MB to 1117 MB within three (very minor) revisions, while duplicati went from 672 MB to 696 MB with the same 3 revisions. We don’t know the exact figures for the test with 1MB chunks test, but we assume the waste of space was reduced.

So now we’re anxiously awaiting your results to see how much improvement we can get out of 128kb chunks.

Regarding another extreme scenario: lots of small files changed often, how would that be affected by small vs large and by variable vs fixed chunk size? What can be said about that without running tests?

gchen · 26 January 2018 02:30

(Copied from the discussion at https://github.com/gilbertchen/duplicacy/issues/334#issuecomment-360641594):

There is a way to retrospectively check the effect of different chunk sizes on the deduplication efficiency. First, create a duplicate of your original repository in a disposable directory, pointing to the same storage:

mkdir /tmp/repository
cd /tmp/repository
duplicacy init repository_id storage_url

Add two storages (ideally local disks for speed) with different chunk sizes:

duplicacy add -c 1M test1 test1 local_storage1
duplicacy add -c 4M test2 test2 local_storage2

Then check out each revision and back up to both local storages:

duplicacy restore -overwrite -delete -r 1 
duplicacy backup -storage test1
duplicacy backup -storage test2
duplicacy restore -overwrite -delete -r 2
duplicacy backup -storage test1
duplicacy backup -storage test2

Finally check the storage efficiency using the check command:

duplicacy check -tabular -storage test1
duplicacy check -tabular -storage test2

towerbr · 26 January 2018 11:35

Dropbox is perhaps the least thread friendly storage, even after Hubic (which is the slowest).

I completely agree, but i have the space there “for free”. Dropbox has a “soft limit” of 300,000 files, and I reached that limit (of files) with just 300 Gb. So the remaining 700 Gb were wasted, and i decided to use for backup.

For professional reasons I can not cancel the Dropbox account now, but in the future I’m planning to move backups to Wasabi or other (pCloud looks interesting also). I’ve used Backblaze B2 in the past but the features of the web interface are limited, in my opinion.

Jonnie · 27 January 2018 12:20

Dropbox has a “soft limit” of 300,000 files

I didn’t know this, but here is their explanation:

Out of curiosity, does anyone know if this refers only to the Dropbox Application and it’s Sync behavior, or to the API too? I was thinking about using Dropbox Pro to store a few TB, but won’t bother if there is a file limit, because Duplicacy will chew that up pretty quick.

It sounds like you can store more than 300,000 files, but it becomes an issue if you are trying to Sync those with the App, does that sound right?

towerbr · 28 January 2018 21:02

It sounds like you can store more than 300,000 files, but it becomes an issue if you are trying to
Sync those with the App, does that sound right?

Yes, it’s exactly like that.I have more than 300,000 files there, but I only sync part of them.

does anyone know if this refers only to the Dropbox Application and it’s Sync behavior,
or to the API too?

That’s a good question!

towerbr · 28 January 2018 21:04

…

towerbr · 28 January 2018 22:18

Well, this is probably my final test with the Evernote folder, and some results were strange, and i really need your help to understand what happened at the end…

Some notes:

Upload and time values were obtained from the log of each software.

Duplicati:
bytes uploaded = “SizeOfAddedFiles”
upload time = “Duration”

Duplicacy:
bytes uploaded = “bytes uploaded” in log line “All chunks”
upload time = “Total running time”

Rclone was used to obtain the total size of each remote backend.

In the first three days I used Evernote normally (adding a few notes a day, a few kb), the result was as expected:

graph01

graph02

graph03

BUT, this is the first point I didn’t understand:

1) How does the total size of Duplicacy backend grows faster if daily uploading is smaller (graph 1 x graph 2)?

Then on 26th I decided to organize my tags in Evernote (something I already wanted to do). So I standardized the nomenclature, deleted obsolete tags, rearranged, etc. That is, I didn’ t add anything (bytes), but probably all the notes were affected.

And the effect of this was:

graph04

graph05

graph06

That is, something similar to the first few days, just greater.

Then, on day 27, I ran an internal Evernote command to “optimize” the database (rebuild the search indexes, etc.) and the result (disastrous in terms of backup) was:

(and at the end there are the other points for which I would like your help to understand)

graph07

graph08

graph09

2) How could Duplicati upload have been so small (57844) if it took almost 3 hours?

3) Should I also consider the “SizeOfModifiedFiles” Duplicati log variable?

4) If the upload was so small, why the total size of the remote has grown so much?

5) Why did Duplicacy last upload take so long? Was it Dropbox’s fault?

I would like to understand all these doubts because this was a test to evaluate the backup of large files with minor daily modifications, and the real objective is to identify the best way to backup a set of 20Gb folders with mbox files, some with several Gb , others with only a few kb.

I look forward to hearing from you all.

(P.S.: Also posted on the Duplicati forum)

gchen · 29 January 2018 01:00

How does the total size of Duplicacy backend grows faster if daily uploading is smaller (graph 1 x graph 2)?

If the upload was so small, why the total size of the remote has grown so much?

I’m puzzled too. Can you post a sample of those stats line from Duplicacy here?

Why did Duplicacy last upload take so long? Was it Dropbox’s fault?

This is because the chunk size is too small, so a lot of overhead is spent on other things like establishing the connections and sending the request headers etc, rather than on sending the actual data. Since the chunk size of 128K didn’t improve the deduplication efficiency by too much, I think 1M should be the optimal value for your case.

towerbr · 29 January 2018 03:03

I’m puzzled too. Can you post a sample of those stats line from Duplicacy here?

Sure! So for that:

You have the full log here.

gchen · 29 January 2018 03:38

That is actually 16,699K bytes.

All chunks: 7314 total, 895,850K bytes; 236 new, 29,529K bytes, 16,699K bytes uploaded

Then, on day 27, I ran an internal Evernote command to “optimize” the database (rebuild the search indexes, etc.) and the result (disastrous in terms of backup) was:

I wonder if the variable-size chunking algorithm can do better for this kind of database rebuild. If you want to try, you can follow the steps in one of my posts above to replay the backups – basically restoring from existing revisions in your Dropbox storage and then back up to a local storage configured to use the variable-size chunking algorithm.