Trying to use Duplicacy Web, filters couldn't be more unintuitive

SkyLinx · 22 October 2024 22:17

Hi all. I am trying to test the web edition again after a few years, and I have been struggling for half an hour to make it work on macOS. I set the root directory to /Volumes, and tried several combinations of filters seen in several threads here and it just won’t back up anything. I always get “ERROR SNAPSHOT_EMPTY No files under the repository to be backed up”. How does this work?

If for example I want to back up /Volumes/Data2/Nextcloud, the following - suggested in the guide and in some threads - doesn’t work:

+Data2/
+Data2/Nextcloud/*
-*

But I have tried really all sorts of combinations without success. How come is this so complicated?

I was about to give up already and move on to some other tool before deciding to ask here.

Any help would be much appreciated.

SkyLinx · 22 October 2024 22:21

Finally got it working like this:

+Data2/Nextcloud/*
+Data2/
-*

saspus · 22 October 2024 23:45

What got it working is wrong, but works accidentally if you have setup repository in /Volumes, which you shall not do. Instead, you probably want to create a repository elsewhere and symlink stuff you want to backup there.

Since you use macOS there is no reason to use filters. You can use Time Machine xattr based exclusion mechanism.

Lastly, you can’t just copy Nextcloud data folder. You shall go though thr proper procedure outlined here: Backup — Nextcloud latest Administration Manual latest documentation and then you can backup the exported data with duplicacy if you want. I’m not sure if you need versioning for that data — there is no reason to ever restore Nextcloud state to previous version since Nextcloud provides its own versioning

SkyLinx · 23 October 2024 01:16

Thanks for the reply but to be honest I lost interest. It’s not intuitive enough and I just find it weird. Since I opened this thread I started testing Kopia and I am sold. It is ridiculously easy to use, and the speed is mind blowing. Same data, same computer, same Gigabit connection. The max I could get with Duplicacy was around 28-30 MB/sec (tried with several settings for threads) while Kopia maxes out my connection at 100-120 MB/sec. The difference in usability and performance is mind blowing.

saspus · 23 October 2024 02:54

Kopia appears to be a superset of duplicacy features, based on the same idea of CAS, but it’s heavily over-engineered, and feels like a playground for their developers egos. I played with it few years ago and managed to corrupt datastore twice and I was unable to recover. (The fact that they still use electron to render a one-page UI is a worthy of a separate rant)

Advantage of duplicacy is simplicity. If something breaks — it’s trivial to fix.

I did not observe performance difference, but if works better for you — sure. Just make sure you test is thoroughly in all scenarios, including interrupted transfers and partially corrupted datastores.

I agree that duplicacy filters are way too complex for no benefit in return. But this is counterbalanced by lack of necessity to use them.

ADMECA · 23 October 2024 04:17

Mmmmhhhhh Kopia …attractive on paper but technically immature.

I have never managed to make it work in ssh, it does not want to take my configuration, (the connection problems in ssh do not seem to be a priority at all for the developers, this problem has been there for more than 2 years and nothing is moving forward).

I have been doing local hard drive tests regularly for the past few years, before giving up 3 months ago after yet another irreparable corruption of backups.

Saspus is right, duplicacy is not perfect, but it is very solid and in case of corruption of the backup system, it is enough to isolate the Snapshot concerned. I have never had a case of irrecoverable data corruption with duplicacy.

towerbr · 23 October 2024 11:04

Remember that speed is not the main factor when it comes to backups; reliability is. And this type of issue (which I also encountered with Kopia):

… actually undermines the perception of reliability, rather than enhancing it.

One feature of Duplicacy that is often overlooked, and which I consider one of its key strengths, is that it doesn’t use a database. This eliminates a major point of failure (such as issues with indexing, corruption, or synchronization). In Duplicacy, the “database” is the filesystem itself.

SkyLinx · 24 October 2024 19:34

Update: I have done some extensive research and decided to forget about Kopia and retry with Duplicacy, since I have read several reports saying that Kopia is unreliable, while I couldn’t find any issues reported with Duplicacy about reliability.

That, plus the facts 1) I understand filters now, so that’s no longer an issue, 2) I figured out the problem with the speed! I was using the default Storj configuration with the satellite etc. I have switched to using the S3 gateway and it’s now maxing out my Gigabit connection just like with Kopia.

I have also enabled erasure coding after reading that it can help when there are problems with corrupted data, and disabled encryption because Storj is already encrypted.

It’s now backing up incredibly fast so fingers crossed. Once the initial backup is done I will do some test restores to see the speed and how it works, before deciding whether to purchase a license or switch back to Arq.

As for the speed, for me it’s important. If incremental backups are fast, I can back up more often and thus be able to restore a version at more points in time. But even more important is restore speed, because if a disaster happens and I need to restore everything, I can’t afford to wait for days.

QQ: is there a way to use a webhook with something like healthchecks.io using the web edition? So I can get notified only when the backup or other operation fails.

Also, how often is it recommended to perform the check operation? Does it matter even with erasure coding enabled?

Thanks!

saspus · 24 October 2024 20:15

Few comments.

You mean you were using native integration and switched to using s3 gateway?

Native integration can provide MASSIVE performance, vastly exceeding that of S3, because it communicates directly with nodes. It also provides end-to-end encryption, the key does not leave your machine.
The drawback is that you need MASSIVE amount of compute power and upload bandwidth – because of a lot of compute required for shading and encoding, and 2.5x upstream amplification. With 1Gigabit upstream you can expect to have max 400 Mbps of useful data upload, by design.

On the other hand, if you had unlimited upstream bandwidth – you could reach crazy speeds. This is however is not needed for backup, so it’s a good choice to use S3. Having limited up stead connection is another good indication to use S3.

If you want to use S3 and still maintain end-to-end encrytion you can run your own storj S3 gateway on the cloud instance you control. But this is completely unnecessary, because duplicacy supports encryption itself.

To get even better performance, increase duplicacy chunk size. Default is 4MB. The closer you get to 64MB the faster everything will work and the more you save on segment fees.

With S3 you are communicating with the gateway, that communicates with storage nodes on your behalf. This means gateway has to have your passphrase (as part of access grant) which it protects by S3 secret, that it also knows. So the data no longer end-to-end encrypted, gateway can theoretically see it.

Turn this off. It’s 100% waste of money. Storj encrypts everything, so if anything is corrupted – it will fail to decrypt. You either get your file back – or nothing. Storj cannot return bad data by design. It actually itself uses erasure coding to store data in 80 shards, 29 of which are necessary to fully recover the data. So another layer of erasure coding is 100% waste.

Incremental backups transfer so little data, so that it does not matter how fast they go. You can backup every 15 min if you want – but you should not. If your data changed every 15 min – you are probably better off with source control and/or other project specific tracking tools. There is no need to do backup to often. But you can, if you want to, of course.

On the contrary. If disaster happens – you don’t need to restore everything immediately. You’ll restore what you need today. And if it takes 2 months to restore the rest – why is this a problem?

Since you are ARQ user, you are probably using Glacier Deep Archive to backup to – and you already know that getting stuff back FAST – is EXPENSIVE, but if you are not in a hurry – quite cheap.

This is not the case with STORJ – the rate is flat, but being so preoccupied by the speed of restore and backup is a bit missing the point. Backup is a by definition background task, that provides insurance. You never expect to need it, therefore focusing on performance of that usecase that must never happen is strange. What is important – is how little it affects your other, more important tasks.

I’m, on the contrary, running duplicacy under CPU throttler, because I don’t want it to go full speed and finish backup in 5 mins, fans blaring. I want it to work in the background, slowly, without any impact. The fact that duplciacy is very performance helps me throttle it deeper, and still manage to backup daily. I also use Arq, also throttled via its own CPU slider. I also use Time Machine, which is already throttled by apple. Each backup takes 5 hours. And I"m fine with that. It’s in the background and does not affect me in any way.

I don’t use web UI, but I remember discussions on the forum about health heck, you may be able to find it…

Check without parameters only ensures that chunks file manifest refers to are still present. It does not check integrity of the chunks (for this you need add -chunks flag) nor restorability of files (for this you need to add -files flag; in both cases it’s very expensive). So it essentially checks datastore consistency, and protects against duplicacy bugs resulting in mismanagement of chunks lifecycle.

Another job if check is to generate statistics files that Duplicacy Web uses to display those plots on the dashboard. So if you want to see plots – you’ll have to run check periodically.

I don’t run neither check not prune at all. However it’s a good policy to periodically try to restore some small subset of data just to ensure that the backup still works – both from technical perspective, and from admin perspective – do you still have access to to keys, passcodes, etc.

SkyLinx · 24 October 2024 20:43

Glad that I saw your reply now before finishing the initial backup!

I am gonna redo it now with the encryption enabled. I was under the wrong assumption that I would benefit from e2e encryption even when using the s3 gateway. I will now enable encryption and disable the erasure coding. I was already using the bigger chunk size.

As for the cost, I actually have free credits for a whopping 5 years with Storj. A nice person from Storj contacted me as she liked my open source work and recognized my name in their customers list, so she gave me credits for 5 years

With Wasabi should I enable the erasure coding? Also is it recommended to have a separate backup with another storage, or just use the copy feature?

Thanks!

SkyLinx · 24 October 2024 22:20

The backup completed but when I try to restore, it doesn’t list any files for restore… what could be the problem? It did backup 400 GB according to the logs.

SkyLinx · 24 October 2024 22:45

I forgot to mention that the backup was interrupted due to a connection interruption, and then I restarted it. No revisions showed any files available to restore, then the check command shows that there were missing chunks. I couldn’t find any way to repair the backup. What should I do in this case? Seems quite fragile to me if there is a connection issue. For now I deleted all the snapshots and am creating another one but I am worried…

saspus · 25 October 2024 00:27

It’s very hard to debug if you deleted everything.

Did you save logs at least? Please post them there. The snapshot revision is uploaded in the very end, so if it was interrupted before — snapshot won’t be there. You would need to run backup again, it will skip uploading already uploaded data and once it’s done it will upload snapshot revision.

Then you cal see it and restore.

Something is not right here — if there was no revisions, check command would not be able to check anything and therefore could not complain about missing chunks.

When you re-configured everything, presumably with the same name, in the same folder — did you clear local cache? Maybe that’s the problem? I can’t think of anything else.