Comparison duplicacy + borg + restic + arq + duplicati

Just stumbled acrosss this:

I think it’s almost time for some kind of meta-analysis of all those #comparison s…

Not sure why duplicacy is rather slow in the real world part of the test:

Real World Benchmark - Onedrive

Item Arq Duplicati Duplicacy Restic
Target Onedrive Onedrive Onedrive Onedrive
Initial Backup N/A N/A 72h+* 36h
Backup Size 438Gb 438Gb 440Gb 450 Gb
Update (+0) 40s 3m30s 33s 1m34s
Update (+200mb) 4m30s 5m20s 2m32s 1m52s
Update (+2.1Gb) 30m30s 6m35s 23m6s 6m12s
Restore (1 file; 3.7Gb) 48m 3m28s 20m28s 57m
Restore (800 files; 2.7Gb) 7m30s 4m35s 15m23s 22m
1 Like
Item Duplicacy
Version 1.30

:thinking:

3 Likes

I think it is a fair comparison when everyone backs up to a HDD storage (you can imagine a hard disk as a cloud storage with unlimited bandwidth). If you look at these numbers you can see Duplicacy win almost every category:

Item Arq 5 Duplicati Duplicacy Restic Borg
Target HDD HDD HDD HDD HDD
Initial Backup 15h 9h24m 14h20m 18h30m 12h
Backup Size 438Gb 435Gb 437Gb 447Gb 437Gb
Update (+0) 40s 1m 4s 1m04s 1m31s
Update (+200mb) 40s 1m12s 16s 12s 3m5s
Update (+2.1Gb) 2m30s 3m25s 2m13s 3m47s 4m42s
Restore (1 file; 3.7Gb) 3m30s 1m20s 1m51s 4m10s 2m30s
Restore (800 files; 2.7Gb) 2m30s 1m45s 1m16s 3m41s 1m40s
Dedup Stress Efficiency (100 * 1Mb) -6% -10% 95% -5% 0%
Dedup Stress Efficiency (16 * 8Mb) 48% -2% 96% 63% 45%
Dedup Stress Efficiency (4 * 128Mb) 98% 24% 99% 99% 98%
Prune 2h50m 78s 7s 7m25s* 40s

When the storage is on OneDrive the results can be quite random, depending on if the server you’re connecting to is overloaded or if the rate limit starts to kick in, to name a few factors.

3 Likes

And yet, All of these metrics are absolutely irrelevant in a backup solution; in the sense that nobody should be choosing backup solution by its speed or deduplication ratio.

Backup is by nature a background process. It’s supposed to be slow and lightweight. Nobody cares if one can backup (or corrupt datastore) 10x faster. Yes, it’s nice that Duplicacy was and is is way of competition in performance but that is not the selling point in any way.

What does matter — is stability in general meaning of the term, resilience to datastore corruption, robust handling of network interruptions, and most importantly clear architecture that inspires confidence In the feasibility of simple and robust implementation.

Heck, the Duplicati does not have a stable version — 1.x is EOL and 2.0 is permanent beta. Why is it in the list to begin with? It can never be a serious contender. Unstable backup solution is an oxymoron.

Some of the other tools mentioned create long chain of dependent backups. The longer is your backup history the more fragile it becomes. Also Fail.

Those things need to be compared and analyzed, not how fast the app is running gzip and copying files to the local hard drive. (Leaving out the relevance of the HDD scenario in the first place: some apps may generate predominantly sequential IO and others - random, and will get penalized unfairly. Backing up to an HDD is a bad artificial usecase; nobody should be doing that and measuring it is therefore pointless. Backup to a cloud or storage appliance will not have such a penalty for random io. For performance testing local SSD shall be used if this arbitrary performance metic is of interest)

3 Likes

(full disclaimer: I am the guy who did the comparison)

I think the value of the benchmarking work is to identify patterns / areas to focus.

For example, on the duplicati board; there was some discussion on 'cati poor results in the ‘dedup stress’ scenarios (where duplicacy’s results really stand out!). End conclusion from the cati folks was that while duplicati is quite poor in this respect; the real world improvement would only to be cases where new data added to middle of file and wasn’t so likely to be worth optimising for.

As Gilbert said, duplicacy very strong in most areas; it was based on this that I purchased a license :grinning: the only wish-list piece I have would be to explore whether duplicacy can support the possibility of increasing the data block sizes; something that I found during testing was that Arq/Duplicacy a bit harder to handle coz the file sizes tend to be smaller (and so all equal many more data files).

(small PS to @saspus - from my understanding, none of the tools tested do a full/incremental approach; they all uses a index/data-blocks approach - so shouldn’t be examples of a long chain of incremental backups that is fragile to breakages in middle).

2 Likes

I guess you have a point. Speed metrics are tempting for comparisons because they are relatively easy and suggest neutrality. Thanks for reminding us what backup actually is about.

Yet, I would maybe want to delete the word “absolutely” in that statement, because if you have an initial backup with one or more terabytes to do, speed is not completely irrelevant.

Perhaps, in addition to speed, CPU (and memory) usage could be worth comparing. If duplicacy is fast but at the cost of affecting performance of the system during the backup process, that is probably not desirable. One of the reasons I started looking for a new backup solution already before Crashplan eventually forced me to was that Crashplan was just such a resource hog.

Turns out this is referring to the version of the web-ui. So the underlying cli version should be a recent one.

2 Likes

It may have been an exaggeration to provoke discussion :).

Normally this is never an issue, as the initial backup is a one time thing, and will eventually complete (I did my initial backup on a 2Mbps connection (1.5TB I think at that time) and I have no clue how long did it take… Bandwidth utilization graph was a straight line at the top for months).
Users who move from another solution [should] still have backup on that solution; and new users that have just decided that their few TB probably should have been backed up all along – few months won’t really change anythig.

The backup solution should however be able to keep up with the new data added during some reasonable period – daily, or weekly, depending on your workflow. Let’s say you dump a few hundred gigabytes of photos each Sunday – then it better be able to transfer that until the next dump. But usually internet connection is a bottleneck, not a CPU. (At least with decently written apps. I don’t want to critique other solutions here but I had less than ideal experience with one of the (very expensive) tools from that table).

Performance may indirectly hint and the design and code quality but this would be even harder to quantify to fit in a nice table.

And this bring back Feature Suggestion: Limit CPU usage. Ridiculous efficiency gives duplicacy a lot of headroom to slow itself doing significantly while still maintaining acceptable performance: there is no reason to hurry to do all the work in 20 seconds only to then sit idly for an hour until next backup. Time machine does the right thing there – it’s a local backup solution over gigabit Ethernet and yet it feeds data by teaspoons and nobody is aware it is even there.

That would give you a good target though, because if performance is adequate to handle daily data turnover making it faster and more efficient may be a good idea if there were no other bugs remained to get addressed; but data safety would be always a priority, regardless of the performance.

To clarify, I’m not saying that performance is not important. All being equal of course one should choose more performant tool. But all is not being equal: moreover, most of that all is difficult to quality and is often not discussed at all; therefore just focusing on performance data (because of a sense that some data is better than none) will not necessarily result in the best choice. (The fact that duplicacy is fast and robust does not extend to all solutions). (copy to /dev/null is very fast, but hardly useful. Raw copy with folder versioning is also super fast, and you can even restore! – but also, not optimal. etc)

‘dedup stress’ scenarios

That’s the thing. Nobody runs backup tools on a “stress test datasets”. They run on actual data and some people have different datasets… To get a realistic reflection of how tool will backup on your specific dataset it would be way easier to just try it than try to quantify and interpolate between synthetic benchmarks and still only get a vague idea.

Testing the robustness however is time consuming and doing ridiculous tests to sabotage the backup process and dataset to see how would the backup program handle that – that experience would be universal, not dependent on a type of data users have and helpful. But I haven’t seen any data like that published.

Not trying to downplay your work by the way – this is all good data, I’m just concerned that having only that data implies “the fastest is the best one” which is not necessarily true.

The HDD or OneDrive one?

If the latter – comparing tools in how well does OneDrive resist abuse from them is hardly useful: I think duplicacy (and all other backup tools) shall drop support for OneDrive, DropBox, and other *drive services because this will never will, should, or can work well: (Newbie) Backup entire machine?

You are right. I thought I saw duplicity there.

On the dedup stress scenario; that initially wasn’t part of the test suite; I got the idea from a post in the restic forums where someone was complaining about how restic wasn’t doing dedup according to their understanding.

I agree with the duplicati guys; usually people append to end of file and not add in blocks of data in middle; but to the extent that some folks do that — the rolling hash an important feature! and the testing would be helpful for them to see duplicacy is strong (and duplicati is hopeless) in that area.

@miab_03c559c2 did you use one thread when uploading to OneDrive?

I ran a test to upload 100G random data to OneDrive Business with 4 upload threads and here are the results:

Backup for /home/gchen/repository at revision 1 completed
Files: 5 total, 102,461M bytes; 5 new, 102,461M bytes
File chunks: 20958 total, 102,461M bytes; 20958 new, 102,461M bytes, 102,833M bytes uploaded
Metadata chunks: 3 total, 1,538K bytes; 3 new, 1,538K bytes, 1,243K bytes uploaded
All chunks: 20961 total, 102,463M bytes; 20961 new, 102,463M bytes, 102,834M bytes uploaded
Total running time: 02:45:38

If that can be extrapolated, the initial backup in your test should take about 12 hours.

Gang,

The testing was based on whatever is the standard defaults when the OneDrive backend is setup; I did not change number of upload threads used.

And this is a very important point. Recent reddit thread that exemplifies the problem: Backup with client-side encryption : synology

I used the app defaults out of the box - no tweaking. I believe such apps should be sufficiently configured for most use cases out of the box and should not need adjusting.

The default must be “auto” — aka “figure out the best number of threads”, aka keep adding threads until saturation. This heuristic will work even when Duplicacy is heavily throttled. One thread however is rarely sufficient, and most people choose tools by whatever criteria they can measure easily — aka how fast it backs up. At worst — make default 10.