Considering Duplicacy with Wasabi

Hi,

I am presently a CrashPlan home user that like so many others is searching for a replacement solution. I have have multiple Linux boxes and a sole Windows box with a little over 3TB of data across all of them that I need to backup. My shortlist storage providers presently include Wasabi and B2. My shortlist software solutions include Duplicacy, Duplicati and CloudBerry.

My primary question is if Duplicacy plus Wasabi can relied upon to produce configure and forget backups that run continuously in the background and then be further relied upon to produce valid restores in the case of a failure? I see some issues related to this combination when searching goggle. Is this a bad/flawed combination?

More generally, I want to set up backups, have confidence that they are both happening and will be useful in the event of an unexpected failure or something. How does Duplicacy plus ??? compare to other alternatives mentioned to make this more reliable?

How/why is Duplicacy a better choice than Duplicati? CloudBerry?

TIA!

Duplicacy is a more stable and mature product than Duplicati. I tested it before adopting Duplicacy. The big problem with Duplicati is the use of database to save the information of backups. You can see from its forum that there are numerous reports of corrupted databases.

I use Duplicacy CLI. It took me some time to learn all the nuances of the settings and nomenclatures (somewhat confusing), but it has been running perfectly with Wasabi for some time now.

I have no experience with CloudBerry, sorry.

In addition to what @towerbr mentioned, one major advantage of Duplicacy is the ability to backup multiple computers to the same storage location. This would result in a huge saving in storage costs if multiple of your computers share a lot of identical files. This is a unique feature of Duplicacy; no other tools can do this, including Duplicati and CloudBerry.

@towerbr, @gchen, Thanks!

@towerbr, any suggested resources that were particularly helpful to come up the learning curve?

@gchen, do the multiple computers need to be backed up to the same ‘bucket’ in order for these savings to be possible? Or can you still have separate buckets fro each machine but Duplicacy manages it somehow. Similarly, do the files need to be ‘named’ and/or ‘pathed’ identically in order to trigger the savings, or does it work on hashing or something else to find dupes?

@towerbr, any suggested resources that were particularly helpful to come up the learning curve?

Unfortunately, not specific. The #how-to is your best initial reference.

I started with very simple scripts and then added parameters later. It was a very empirical process.

However I have some tips:

  1. Use the centralized configuration, it is much easier to manage than the decentralized configuration (see -pref-dir at Init command details); (However not everyone here agrees with me … lol)

  2. When creating your repositories, rename the first “default” entry in the preferences file to a clearer name, to make it easier to use the -storage <storage name> option.

  3. The include / exclude patterns are slippery, study them before applying Filters/Include exclude patterns.

Similarly, do the files need to be ‘named’ and/or ‘pathed’ identically in order to trigger the savings,
or does it work on hashing or something else to find dupes?

I think you should study a little more how Duplicacy works. A good pass on the wiki would be interesting. Duplicacy does not back up files, but chunks of the files.

do the multiple computers need to be backed up to the same ‘bucket’ in order for these savings to be possible? Or can you still have separate buckets fro each machine but Duplicacy manages it somehow. Similarly, do the files need to be ‘named’ and/or ‘pathed’ identically in order to trigger the savings, or does it work on hashing or something else to find dupes?

Yes, they should be backed up to the same bucket (and the same destination directory) for the cross-computer deduplication to take effect. No, the files don’t need to have the same name or the same path relative to their own repository root. Duplicacy breaks down files into chunks, and whenever there are identical chunks, the deduplication mechanism will kick in.

Any thoughts on wasabi vs b2 or others - looking for rock solid reliability at an affordable price? Wasabi free api/egress options appears compelling, but perhaps I’m missing something.

Yes, they should be backed up to the same bucket (and the same destination directory) for the cross-computer deduplication to take effect. No, the files don't need to have the same name or the same path relative to their own repository root. Duplicacy breaks down files into chunks, and whenever there are identical chunks, the deduplication mechanism will kick in.

Based upon the above then, if I want to be able to, perhaps, see and independently restore to/from different machines, what is the recommended approach? tags? multiple snapshot names with same destination?

You just need to use different repository ids on different machines.

On machine 1:

duplicacy init id1 wasabi://bucket

On machine 2:

duplicacy init id2 wasabi://bucket

I’m sorry for the basic questions, but I’m struggling with filters and a few other concepts a bit:

  1. Are filters required? What I mean is, if no filters are provided in either .duplicacy/filters file or on command line will the default be to include all dirs/files recursively in the repository?

  2. Given the following filters contents:

+<parent dir>
+<parent dir>/<subdir>
+<parent dir>/<subdir>/*

Does this imply that all contents and all children dirs recursively under ,subdir> will be backed up? What if /</grand-child dir> is later added following intial backup? Will its contents also be recursively backed up even if filters is not modified?

  1. If they are required, what is the simplest way to achieve the described behavior mentioned above? It seems like adding every single path including parents could become quite cumbersome and error prone - especially if you forget to modify filters every time you add a new dir!

  2. Once you have init-ed a repository and run a backup, what command can you use to list all files contained in a specified dir from the repository? the history command appears to allow me to show info for a given dir, but there does not appear to be a way to show similar information for the dir contents?

  3. Similarly, what command would you use to restore, for example, a given repository subdir and all of its contents recursively?

  4. If I have a rather large repository that I anticipate will take many days, perhaps weeks, to complete the initial backup, what happens if that initial backup is interrupted - perhaps an unanticipated power loss or required system reboot - before the initial backup is completed? Will it have to start all over? Will it recover gracefully? Any important command options to include or special commands to use with this in mind?

  5. It appears as if the global options must appear before the command (backup, history, etc), while the non-global options follow the command. Is that correct? I was surprised to see that order mattered this much?

  6. When using the -background option, it is possible to query where the credentials are coming from/stored? Is it possible to remove them? Update them?

TIA!

  1. Are filters required? What I mean is, if no filters are provided in either .duplicacy/filters file or on command line will the default be to include all dirs/files recursively in the repository?

yes.

  1. Given the following filters contents:
    +
    +/
    +//*
    Does this imply that all contents and all children dirs recursively under ,subdir> will be backed up? What if / is later added following intial backup? Will its contents also be recursively backed up even if filters is not modified?

Not sure what you meant by if / is later added following intial backup?.

  1. If they are required, what is the simplest way to achieve the described behavior mentioned above? It seems like adding every single path including parents could become quite cumbersome and error prone - especially if you forget to modify filters every time you add a new dir!

I would suggest using exclude patterns instead. Only exclude certain subdirectories that you don’t want to back up, so new dirs will be included by default.

  1. Once you have init-ed a repository and run a backup, what command can you use to list all files contained in a specified dir from the repository? the history command appears to allow me to show info for a given dir, but there does not appear to be a way to show similar information for the dir contents?

You can run the list command and then filter the output:

duplicacy list -files | grep path/to/some/dir
  1. Similarly, what command would you use to restore, for example, a given repository subdir and all of its contents recursively?

restore takes exclude/include patterns as arguments:

duplicacy restore -r 1 -- +/subdir* -*
  1. If I have a rather large repository that I anticipate will take many days, perhaps weeks, to complete the initial backup, what happens if that initial backup is interrupted - perhaps an unanticipated power loss or required system reboot - before the initial backup is completed? Will it have to start all over? Will it recover gracefully? Any important command options to include or special commands to use with this in mind?

An initial backup can be fast-resumed. You can try it yourself.

  1. It appears as if the global options must appear before the command (backup, history, etc), while the non-global options follow the command. Is that correct? I was surprised to see that order mattered this much?

That is right.

  1. When using the -background option, it is possible to query where the credentials are coming from/stored? Is it possible to remove them? Update them?

The -background option was mainly designed for the GUI version. The recommended way to force reentering passwords is to run duplicacy list -reset-passwords.

2) Given the following filters contents: + +/ +//* Does this imply that all contents and all children dirs recursively under ,subdir> will be backed up? What if / is later added following intial backup? Will its contents also be recursively backed up even if filters is not modified?

Not sure what you meant by if / is later added following intial backup?.

My apologies for the typo: the markdown appears to have mangled my intent. The question was supposed to say What if <parent dir>/<subdir>/<grand-child-dir> is later added after initial backup? Will the grand-child-dir contents also be recursively backed up even if filters is not modified?

8) When using the -background option, it is possible to query where the credentials are coming from/stored? Is it possible to remove them? Update them?

The -background option was mainly designed for the GUI version. The recommended way to force reentering passwords is to run duplicacy list -reset-passwords.

I’d still like to know if it is possible to query where the credentials are coming from/stored please?

Yes, <parent dir>/<subdir>/<grand-child-dir> matches the pattern +<parent dir>/<subdir>/* so they will be included.

You can run duplicacy -d list and there will be some log messages showing how credentials are read.

3) If they are required, what is the simplest way to achieve the described behavior mentioned above? It seems like adding every single path including parents could become quite cumbersome and error prone - especially if you forget to modify filters every time you add a new dir!

I would suggest using exclude patterns instead. Only exclude certain subdirectories that you don't want to back up, so new dirs will be included by default

Unfortunately, in my scenario, I will need to use positive/include filters in some places. Given that requirement, what is the simplest filter combination to get a given subdir and all of its cursive contents included - event if further grandchildren are added later following filter setup?

The patterns you showed above should work:

+<parent dir>
+<parent dir>/<subdir>
+<parent dir>/<subdir>/*

Since no exclude patterns are specified so any subdirs not matching these patterns will be excluded. Any files/dirs under `// will be included because of the third pattern.

This wiki page explains how include/exclude patterns work in details.

My experimentation so far has led me to my next basic question. I have read the wiki page regarding use of -hash relative to the backup command and understand it’s purpose and have even experimented with it locally, so I can see the resulting effects.

My question though is in terms of best practice… It seems clear that the safest/most conservative approach would be to always backup with the -hash flag enabled - especially when backing up to cloud storage (wasabi). However, this obviously entails quite a bit more processing overhead - especially for a large repository. What are the best practice recommendations in this regard. Do most people opt for the cautious approach and pay the overhead premium, or do most opt out of -hash and use some other mechanism to increase confidence (periodic checks with -files option or something along those lines)? Or, for modern Linus and/or Windows file systems, is -hash just effectively unnecessary overkill in practice?

Hi skidvd- would you mind if I ask you a question?

What are you thinking -hash will accomplish for you as far as reliability goes? I dont use it as it doesnt seem that important to me. It seems to me that -hash is just a different way to verify a file has changed. I guess there are other issues I am more concerned about reliability wise than this…

Do you think I am understanding things wrong?

Thanks!!

Since the update of the timestamp of a file is automatic, I think it is ok to run the backup without the -hash option. Very rarely would any software modify a file and then deliberately roll back the timestamp to the previous one.

@kevinvinv, @gchen, no I think that your understanding is in line with mine. I suppose reliability is a very poor word choice. What I had in mind with that post was the potential for frequently (but very minimally so) updated files. I have been concerned that some may occasionally slip through the cracks - especially with frequent backup interval (perhaps every 15 min or so to approximate continuous backup like CrashPlan). I had thought that a full hash computation on each file may offer some greater assurance that nothing was being missed. Perhaps this is just paranoia?

I guess there are other issues I am more concerned about reliability wise than this...

As a rock-solid, 100% reliable backup (and restore) solution is my ultimate concern as well, I’m very curious what you have in mind and what you may be doing address it?

Hi skidvd

My personal opinion is that I just back up once per day. I did that with Crashplan too. I dont personally want the thing crunching on my CPU all day every 15 minutes. But I see why you do and would not criticize that decision…

My reliability concern with duplicacy is server side corruption. It doesnt do anything to make sure the backups are restorable and the chunk files are not corrupted.

You can make sure all the chunk files for a given snapshot exist… and that is pretty good but you cant easily (from the server side… as in a local NAS etc) make sure the individual chunks have not be corrupted. The only way you can verify a chunk integrity is to download it back and then verify it. That is too costly.

Crashplan could always verify backup integrity b/c it was running a server side app that could always verify checksums or hashes or whatever… but duplicacy doesnt do that… instead it basically trusts that the server wont corrupt the stored backup… pretty good assumption in general but that is indeed what I worry about.

gchen is planning on adding some hooks to make server side chunk integrity verification more possible but they havent arrived yet.

Hi kevinvinv,

Yes, you raise a very good point. One of the double-edged swords I have been considering in my evaluation process… Duplicati, for example, does download some random chunks with each backup for this express purpose. However, as you noted, this is a rather expensive option - especially if you are charged for egress and/or API calls as is likely the case other than at wasabi.

I am anxious to learn more about gchen’s plans and timeline in this regard.

So, if one were to periodically run check commands (with -files option) say perhaps weekly, this will ensure snapshots have all required chunks. However, I’m not clear on how a server (I think you are referring to repository source here - correct?) side only solution could be made to verify that the chunk is in fact safe transported and stored on the remote cloud storage? I supposed it could verify a chunk checksum/hash etc before transport to ensure it was created reliably - is this what you are getting at? However, doesn’t that leave an opening for errors to be introduced (and more importantly, missed without any means to verify) during transport and storage on remote destination? Does the wasabi or other cloud-based storage providers provide any mechanism to calc and retrieve checksums/hash of file on their storage for comparison to what was sent?