Comparison duplicacy + borg + restic + arq + duplicati

Christoph · 10 September 2020 19:49

Just stumbled acrosss this:

I think it’s almost time for some kind of meta-analysis of all those #comparison s…

Not sure why duplicacy is rather slow in the real world part of the test:

Real World Benchmark - Onedrive

Item Arq Duplicati Duplicacy Restic

Target Onedrive Onedrive Onedrive Onedrive

Initial Backup N/A N/A 72h+* 36h

Backup Size 438Gb 438Gb 440Gb 450 Gb

Update (+0) 40s 3m30s 33s 1m34s

Update (+200mb) 4m30s 5m20s 2m32s 1m52s

Update (+2.1Gb) 30m30s 6m35s 23m6s 6m12s

Restore (1 file; 3.7Gb) 48m 3m28s 20m28s 57m

Restore (800 files; 2.7Gb) 7m30s 4m35s 15m23s 22m

towerbr · 10 September 2020 23:32

Item	Duplicacy
Version	1.30

gchen · 11 September 2020 02:34

I think it is a fair comparison when everyone backs up to a HDD storage (you can imagine a hard disk as a cloud storage with unlimited bandwidth). If you look at these numbers you can see Duplicacy win almost every category:

Item	Arq 5	Duplicati	Duplicacy	Restic	Borg
Target	HDD	HDD	HDD	HDD	HDD
Initial Backup	15h	9h24m	14h20m	18h30m	12h
Backup Size	438Gb	435Gb	437Gb	447Gb	437Gb
Update (+0)	40s	1m	4s	1m04s	1m31s
Update (+200mb)	40s	1m12s	16s	12s	3m5s
Update (+2.1Gb)	2m30s	3m25s	2m13s	3m47s	4m42s
Restore (1 file; 3.7Gb)	3m30s	1m20s	1m51s	4m10s	2m30s
Restore (800 files; 2.7Gb)	2m30s	1m45s	1m16s	3m41s	1m40s
Dedup Stress Efficiency (100 * 1Mb)	-6%	-10%	95%	-5%	0%
Dedup Stress Efficiency (16 * 8Mb)	48%	-2%	96%	63%	45%
Dedup Stress Efficiency (4 * 128Mb)	98%	24%	99%	99%	98%
Prune	2h50m	78s	7s	7m25s*	40s

When the storage is on OneDrive the results can be quite random, depending on if the server you’re connecting to is overloaded or if the rate limit starts to kick in, to name a few factors.

saspus · 11 September 2020 04:13

And yet, All of these metrics are absolutely irrelevant in a backup solution; in the sense that nobody should be choosing backup solution by its speed or deduplication ratio.

Backup is by nature a background process. It’s supposed to be slow and lightweight. Nobody cares if one can backup (or corrupt datastore) 10x faster. Yes, it’s nice that Duplicacy was and is is way of competition in performance but that is not the selling point in any way.

What does matter — is stability in general meaning of the term, resilience to datastore corruption, robust handling of network interruptions, and most importantly clear architecture that inspires confidence In the feasibility of simple and robust implementation.

Heck, the Duplicati does not have a stable version — 1.x is EOL and 2.0 is permanent beta. Why is it in the list to begin with? It can never be a serious contender. Unstable backup solution is an oxymoron.

Some of the other tools mentioned create long chain of dependent backups. The longer is your backup history the more fragile it becomes. Also Fail.

Those things need to be compared and analyzed, not how fast the app is running gzip and copying files to the local hard drive. (Leaving out the relevance of the HDD scenario in the first place: some apps may generate predominantly sequential IO and others - random, and will get penalized unfairly. Backing up to an HDD is a bad artificial usecase; nobody should be doing that and measuring it is therefore pointless. Backup to a cloud or storage appliance will not have such a penalty for random io. For performance testing local SSD shall be used if this arbitrary performance metic is of interest)

miab_03c559c2 · 11 September 2020 05:29

(full disclaimer: I am the guy who did the comparison)

I think the value of the benchmarking work is to identify patterns / areas to focus.

For example, on the duplicati board; there was some discussion on 'cati poor results in the ‘dedup stress’ scenarios (where duplicacy’s results really stand out!). End conclusion from the cati folks was that while duplicati is quite poor in this respect; the real world improvement would only to be cases where new data added to middle of file and wasn’t so likely to be worth optimising for.

As Gilbert said, duplicacy very strong in most areas; it was based on this that I purchased a license the only wish-list piece I have would be to explore whether duplicacy can support the possibility of increasing the data block sizes; something that I found during testing was that Arq/Duplicacy a bit harder to handle coz the file sizes tend to be smaller (and so all equal many more data files).

(small PS to @saspus - from my understanding, none of the tools tested do a full/incremental approach; they all uses a index/data-blocks approach - so shouldn’t be examples of a long chain of incremental backups that is fragile to breakages in middle).

Christoph · 11 September 2020 06:46

I guess you have a point. Speed metrics are tempting for comparisons because they are relatively easy and suggest neutrality. Thanks for reminding us what backup actually is about.

Yet, I would maybe want to delete the word “absolutely” in that statement, because if you have an initial backup with one or more terabytes to do, speed is not completely irrelevant.

Perhaps, in addition to speed, CPU (and memory) usage could be worth comparing. If duplicacy is fast but at the cost of affecting performance of the system during the backup process, that is probably not desirable. One of the reasons I started looking for a new backup solution already before Crashplan eventually forced me to was that Crashplan was just such a resource hog.

Turns out this is referring to the version of the web-ui. So the underlying cli version should be a recent one.

saspus · 11 September 2020 19:13

It may have been an exaggeration to provoke discussion :).

Normally this is never an issue, as the initial backup is a one time thing, and will eventually complete (I did my initial backup on a 2Mbps connection (1.5TB I think at that time) and I have no clue how long did it take… Bandwidth utilization graph was a straight line at the top for months).
Users who move from another solution [should] still have backup on that solution; and new users that have just decided that their few TB probably should have been backed up all along – few months won’t really change anythig.

The backup solution should however be able to keep up with the new data added during some reasonable period – daily, or weekly, depending on your workflow. Let’s say you dump a few hundred gigabytes of photos each Sunday – then it better be able to transfer that until the next dump. But usually internet connection is a bottleneck, not a CPU. (At least with decently written apps. I don’t want to critique other solutions here but I had less than ideal experience with one of the (very expensive) tools from that table).

Performance may indirectly hint and the design and code quality but this would be even harder to quantify to fit in a nice table.

And this bring back Feature Suggestion: Limit CPU usage. Ridiculous efficiency gives duplicacy a lot of headroom to slow itself doing significantly while still maintaining acceptable performance: there is no reason to hurry to do all the work in 20 seconds only to then sit idly for an hour until next backup. Time machine does the right thing there – it’s a local backup solution over gigabit Ethernet and yet it feeds data by teaspoons and nobody is aware it is even there.

saspus · 11 September 2020 19:35

That would give you a good target though, because if performance is adequate to handle daily data turnover making it faster and more efficient may be a good idea if there were no other bugs remained to get addressed; but data safety would be always a priority, regardless of the performance.

To clarify, I’m not saying that performance is not important. All being equal of course one should choose more performant tool. But all is not being equal: moreover, most of that all is difficult to quality and is often not discussed at all; therefore just focusing on performance data (because of a sense that some data is better than none) will not necessarily result in the best choice. (The fact that duplicacy is fast and robust does not extend to all solutions). (copy to /dev/null is very fast, but hardly useful. Raw copy with folder versioning is also super fast, and you can even restore! – but also, not optimal. etc)

‘dedup stress’ scenarios

That’s the thing. Nobody runs backup tools on a “stress test datasets”. They run on actual data and some people have different datasets… To get a realistic reflection of how tool will backup on your specific dataset it would be way easier to just try it than try to quantify and interpolate between synthetic benchmarks and still only get a vague idea.

Testing the robustness however is time consuming and doing ridiculous tests to sabotage the backup process and dataset to see how would the backup program handle that – that experience would be universal, not dependent on a type of data users have and helpful. But I haven’t seen any data like that published.

Not trying to downplay your work by the way – this is all good data, I’m just concerned that having only that data implies “the fastest is the best one” which is not necessarily true.

The HDD or OneDrive one?

If the latter – comparing tools in how well does OneDrive resist abuse from them is hardly useful: I think duplicacy (and all other backup tools) shall drop support for OneDrive, DropBox, and other *drive services because this will never will, should, or can work well: (Newbie) Backup entire machine?

You are right. I thought I saw duplicity there.

miab_03c559c2 · 11 September 2020 23:35

On the dedup stress scenario; that initially wasn’t part of the test suite; I got the idea from a post in the restic forums where someone was complaining about how restic wasn’t doing dedup according to their understanding.

I agree with the duplicati guys; usually people append to end of file and not add in blocks of data in middle; but to the extent that some folks do that — the rolling hash an important feature! and the testing would be helpful for them to see duplicacy is strong (and duplicati is hopeless) in that area.

gchen · 22 September 2020 18:47

@miab_03c559c2 did you use one thread when uploading to OneDrive?

I ran a test to upload 100G random data to OneDrive Business with 4 upload threads and here are the results:

Backup for /home/gchen/repository at revision 1 completed
Files: 5 total, 102,461M bytes; 5 new, 102,461M bytes
File chunks: 20958 total, 102,461M bytes; 20958 new, 102,461M bytes, 102,833M bytes uploaded
Metadata chunks: 3 total, 1,538K bytes; 3 new, 1,538K bytes, 1,243K bytes uploaded
All chunks: 20961 total, 102,463M bytes; 20961 new, 102,463M bytes, 102,834M bytes uploaded
Total running time: 02:45:38

If that can be extrapolated, the initial backup in your test should take about 12 hours.

miab_03c559c2 · 22 September 2020 22:21

Gang,

The testing was based on whatever is the standard defaults when the OneDrive backend is setup; I did not change number of upload threads used.

saspus · 23 September 2020 03:25

And this is a very important point. Recent reddit thread that exemplifies the problem: Backup with client-side encryption : synology

I used the app defaults out of the box - no tweaking. I believe such apps should be sufficiently configured for most use cases out of the box and should not need adjusting.

The default must be “auto” — aka “figure out the best number of threads”, aka keep adding threads until saturation. This heuristic will work even when Duplicacy is heavily throttled. One thread however is rarely sufficient, and most people choose tools by whatever criteria they can measure easily — aka how fast it backs up. At worst — make default 10.

Item	Arq	Duplicati	Duplicacy	Restic
Target	Onedrive	Onedrive	Onedrive	Onedrive
Initial Backup	N/A	N/A	72h+*	36h
Backup Size	438Gb	438Gb	440Gb	450 Gb
Update (+0)	40s	3m30s	33s	1m34s
Update (+200mb)	4m30s	5m20s	2m32s	1m52s
Update (+2.1Gb)	30m30s	6m35s	23m6s	6m12s
Restore (1 file; 3.7Gb)	48m	3m28s	20m28s	57m
Restore (800 files; 2.7Gb)	7m30s	4m35s	15m23s	22m