Hi,
I’ve been trying out various backup solutions that I hope can replace CrashPlan in a few months. The field has been narrowed down to three candidates: Duplicacy, Arq and Hashbackup. While I like Duplicacy a lot, I’ve seen some puzzling issues.
Some basic information first:
-
I want to back up to the cloud. (All three can also do local backups; just to make clear my focus of evaluation.)
-
My cloud backup destination of choice post CrashPlan would be Google Drive. (For I get unlimited storage through my university G Suite subscription.)
-
The contract bandwidth with my ISP is 100Mbs/40Mbs.
-
I’m backing up from a Mac, though I do have a Linux home server, and may consider moving my cloud backup operation to it someday.
-
I plan to conduct my backups from a single source machine (Mac now, maybe Linux later).
-
The size of data to be backed up is about 4+ TB at the moment: 10GB documents and other small files, 20GB mail, 50 GB photo, 500 GB music, and 3.5 TB video. Only the first two sets (doc and mail) see regular daily changes. Others come in bunches after travels or when I have time to rip more CD/DVD/BDs.
Now, the questions:
Duplicacy’s main attraction is the well-crafted (and well-explained) lock-free deduplication system design that enables backing up multiple sources to the same destination, maximizing dedup efficiency. It also promises fast backup/restore operations. The former isn’t that important to me for the time being due to reason 5 above, but is a nice feature nonetheless.
In my tests, Duplicacy is indeed fast when the backup storage is on a local disk. When trying to restore a directory of some 20MB in size from the Google Drive, however, Duplicacy spent more than 40 min. to produce the file list for the revision I asked for, and then another 40+ min. to restore the data. For comparison, both ARQ and Hashbackup did both jobs in seconds.
Since I haven’t uploaded everything yet with Duplicacy (the folder on Google Drive reserved for Duplicacy backups stands at 552 GB), and have conducted only about 20 rounds of backups, it worries me how long a restore will take down the road.
After repeated tests (with similar results), I found Duplicacy spent a long time pulling down some 3+ GB data into the cache (under the hidden .duplicacy folder in the repository) before establishing the file list, then another 3+ GB data into another cache folder under the target directory when restoring. The two cache folders were bit-identical. So obviously Duplicacy didn’t utilize the first cache folder during the 2nd step. Is there a good reason for this?
Supposing that’s a bug that can be easily corrected, it still doesn’t explain the long restoration time. When backing up, I usually see 160-250 MB/min. upload speed, so the download speed should at least double that. Pulling down 3+ GB of data should take less than 20 min on a bad day. On what was the other 20 min. spent?
The directory I tried to restore was “~/Library/Application Support”. It’s not large is size (20MB or so), but is constantly changed so that each backup round would pick up some changes. It seems to me Duplicacy was taking time to find what chunks to download.
If that’s the case, it’s a common issue for all dedup backup software. Arq & Hashbackup can nevertheless turn to their local database for such information, while Duplicacy has to pull the information from the “backup storage”. When that storage is local, the performance doesn’t suffer, but if it’s in the cloud, the performance impact could be quite severe unless you have a very fat pipe. (100/40 Mbs is pretty fat already, at least in my country.)
I wonder if this is something that can be alleviated by a larger cache, large enough to cache all chunks allocation info locally.
That brings us to the next issue: the overhead of Duplicacy operation seems to be substantially greater than Arq & Hashbackup. It’s an accidentally discovery from a pair of tests designed to test dedup efficiency. The first involves two Calibre ebook library folders:
A: 2145 total items (folders and files), total size 2GB.
B: 2036 total items, total size 1.7GB.
B is essentially a subset of A, with only 4 files (totaling less than 2MB) different from A.
If I back up A first, then A + B, with Duplicacy, the total size of the backup storage would grow from 1773MB to 2030MB, an increase of 256MB, much larger than the 2MB difference in original data. Arq, Hashbackup & Borg (a candidate before being eliminated), in comparison, saw their total storage grew by less than 2MB (1.6, 0.8 & 1.5 MB respectively).
Since my curiosity was piqued, another experiment was conducted with two much larger folder of music files:
C: 3358 items, total size 64.5 GB.
D: 3062 items, total size 62.6 GB.
Again D is a subset of C (they are in fact hardlinks, taking up no additional space on the source disk), with only 3 different files (two text files and a .xlsx file) totaling less than 1MB. Backing up C first, then C + D, Duplicacy’s total storage grew from 61GB to 74GB, an increase of almost 13GB! Hashbackup & Borg again saw trivial increases: 1.4 & 2 MB respectively. (Arq wasn’t tested for my trial period was up.)
The backup time suffered because of it, too. It’s tolerable only because the experiment was conducted on local HDs. Uploading 13GB to Google Drive would take me somewhere between 50-80 min. Restoring from cloud would suffer for the same reason, and to cache all these data locally (as alluded earlier) would require a lot of space.
There are other issues and questions. Since this post is way too long already, however, I’ll stop here. Thanks for reading if you make it here.