Duplicacy vs Duplicati

  1. How does the total size of Duplicacy backend grows faster if daily uploading is smaller (graph 1 x graph 2)?
  1. If the upload was so small, why the total size of the remote has grown so much?

I’m puzzled too. Can you post a sample of those stats line from Duplicacy here?

  1. Why did Duplicacy last upload take so long? Was it Dropbox’s fault?

This is because the chunk size is too small, so a lot of overhead is spent on other things like establishing the connections and sending the request headers etc, rather than on sending the actual data. Since the chunk size of 128K didn’t improve the deduplication efficiency by too much, I think 1M should be the optimal value for your case.

I’m puzzled too. Can you post a sample of those stats line from Duplicacy here?

Sure! So for that:

You have the full log here.

That is actually 16,699K bytes.

All chunks: 7314 total, 895,850K bytes; 236 new, 29,529K bytes, 16,699K bytes uploaded

Then, on day 27, I ran an internal Evernote command to “optimize” the database (rebuild the search indexes, etc.) and the result (disastrous in terms of backup) was:

I wonder if the variable-size chunking algorithm can do better for this kind of database rebuild. If you want to try, you can follow the steps in one of my posts above to replay the backups – basically restoring from existing revisions in your Dropbox storage and then back up to a local storage configured to use the variable-size chunking algorithm.

That is actually 16,699K bytes.

Sorry, the graphs titles are wrong, I will correct them ASAP.

But the question remains: 1) How does the total size of Duplicacy backend grows faster if daily uploading is smaller (graph 1 x graph 2)?

I wonder if the variable-size chunking algorithm can do better for this kind of database rebuild.

But it has to support both situations: small daily changes and these kind of “rebuilds”.

Switching to variable chunks will not make us to return to the beginning of the tests?

Sorry, the graphs titles are wrong, I will correct them ASAP.

Graphs have been fixed!

  1. How does the total size of Duplicacy backend grows faster if daily uploading is smaller (graph 1 x graph 2)?

I guess that is because Duplicati compacts chunks of 100K bytes into zip files of 50M bytes by default, so the compression works better.

Switching to variable chunks will not make us to return to the beginning of the tests?

You can. And that is what I meant by “replay the backups”. You can create a new repository with the same repository id and storage url, but add a new storage with -c 128k being the only argument, then restore the repository to the revision before the database rebuild, back up to the additional storage, and finally restore the repository to the revision after the database rebuild and back up to the additional storage. The additional storage can be only your local disk to save running time.

I guess that is because Duplicati compacts chunks of 100K bytes into zip files of 50M bytes by default,
so the compression works better.

Maybe. However, in my case, I set up Duplicati “volumes” for 5M.

You can…

What I meant is that doing a new backup with 1M variable chunks will not be equivalent to the first few tests above (Jan 22 6:53AM)? That is, it may even work well for rebuild, but will cause problems with minor database changes.

I know there are a lot of mysterious things going on, but one thing seems to be clear: with 128kB chunks, we are not seeing an increasing gap between duplicati storage use and duplicacy storage use, right?

It also seems clear/confirmed, that a 128kB chunk size is not a good idea due to speed. Generally, I would say: I don’t care about speed because the backup runs in the background and it doesn’t matter how long it takes. But the uploads on 26 and 27 January made me change my mind (if not the one on the 26th then definitely the one on the 27th). If backups can take 9 hours even though not a lot of data have been added to the repository, that means that, at least on my home computer, chances are that it will not complete the same day that it started and since I don’t turn on my home PC every day, it may well take a few days, perhaps even a week, until the backup completes, leading to all kinds of possible issues. Right?

I agree with both conclusions!

Gilbert, I’ll do the tests with local backup of Evernote you mentioned, but only in a few days, now I’m a little busy. For now, the two backup jobs are suspended to avoid data collection problems out of controlled conditions.

I’m also thinking of doing a test with local storage for the mbox files I mentioned above. I intend to do this test with 1M variable chunks, I understood that should be the best option, right?

Yes, 1M variable chunks should be fine. I suspect fixed-size chunking would work poorly for mbox files, because, if I understand it correctly, a mbox file is just a number of emails concatenated together so a deleted or added email would cause every chunk to be different.

Duplicacy seems to me a … more mature software

I’m not so sure about that anymore: https://github.com/gilbertchen/duplicacy/issues/346
(This is very serious, IMHO)

Interesting … following there…

I agree that it is really bad if directories supposed to be backed up are not included in the backup.

But, when Duplicacy failed to list a directory, it will give out warnings like these (copied from the github issue):

Subdirectory dir1/ cannot be listed
Subdirectory dir2/ cannot be listed
File file1 cannot be opened
Backup for /Users/chgang/AcrosyncTest/repository at revision 2 completed
2 directories and 1 file were not included due to access errors

I can’t think of a reason other than permission issues that can cause Duplicacy to skip a directory or file without giving out such warnings.

You may be interested to check out this thread updated today: Backing up stuff on a different hard drive. A user is seeing a lot of “Failed to read symlink” messages reported by Duplicacy, whereas Crashplan would just silently back up the empty folders without backing up any files there.

I decided to set up a repository in GitHub to put the results of the tests I’m executing. For anyone interested: link

I dropped duplicati when the initial backup of one of my data sets (about 1TB on 360k files) took 48hs to complete. Duplicacy did it in 5hs (restic in 3hs 20).
The following backups were quite fast but so are duplicacy’s. Duplicati’s restore times are also quite bad compared to Duplicacy.

1 Like

I’m curious why Duplicacy is slower than restic in your case. How many upload threads did you use? By default Duplicacy uses only one but I don’t know about restic.

I was using duplicacy with 8 threads, not sure how many restic uses but these were my results over a week of backing up to a local sftp server:

CPU usage and bandwidth comparison of initial backup:

Restic restore CPU and bandwidth:

Duplicacy restore CPU and bandwidth:

1 Like

I wonder if 8 threads is too many for a local sftp server. Maybe 4 or even 2 would work better because of less contention.

I think I tried it first with 4 and then increased it to 8 when I saw restic was faster. I’ll try it sometime next week again with only 2 and come back with the results.

We still chose Duplicacy for other more important reasons :slight_smile: Such as half the restore time, lock free dedup and lower backup sizes.