Cipher: message authentication failed

My prune scheduled prunes and checks are all of a sudden all failing and now also my backups seem to be affected:

The last prune failed like this:

2021-04-26 05:11:35.804 INFO PRUNE_NEWSNAPSHOT Snapshot NAS_christoph revision 162 was created after collection 3
2021-04-26 05:12:38.173 WARN DOWNLOAD_RETRY Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed; retrying
2021-04-26 05:12:39.980 WARN DOWNLOAD_RETRY Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed; retrying
2021-04-26 05:12:41.023 WARN DOWNLOAD_RETRY Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed; retrying
2021-04-26 05:12:41.912 ERROR DOWNLOAD_DECRYPT Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed
Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed

The last check, failed like this:

2021-04-24 06:13:46.340 INFO SNAPSHOT_CHECK All chunks referenced by snapshot NAS_christoph at revision 161 exist
2021-04-24 06:13:49.543 WARN DOWNLOAD_RETRY Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed; retrying
2021-04-24 06:13:50.921 WARN DOWNLOAD_RETRY Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed; retrying
2021-04-24 06:13:52.185 WARN DOWNLOAD_RETRY Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed; retrying
2021-04-24 06:13:53.073 ERROR DOWNLOAD_DECRYPT Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed
Failed to decrypt the chunk 678c1453b36e44663c330d20a2f5a93c4722180108c94fce0a5e1ef1827ae493: cipher: message authentication failed

What’s going on?

I checked the storage for the chunk that it fails to decrypt and I see nothing unusual. It is 887.7 kB in size and located in the correct folder (67).

This chunk is corrupted. What storage is this?

This is WebDAV storage.

How does a chunk get corrupted and how to prevent it?

Quick update: last night’s backup went through without errors, so it seems that backups are not affected. But what about check and prune. How to fix the corrupted chunk?

Unless that same corrupted chunk is still referenced… which is a strong possibility. The backup won’t give any errors so you’ll need to make sure to resolve that before assuming subsequent backups are in good nick.

A quick fix is to delete the corrupted chunk, run a normal check and delete any bad snapshot revisions that use it.

Or you could try to recreate the chunk by running a backup on the same repository but with a fresh, temporary, snapshot ID. Not a guarantee to work, however.

So local cache has nothing to do with this?

Last time I had zero size chunks that approach didn’t work, but I’ll try again. The only problem is I will have to figure out anew how exactly to do this. Not exactly a turn-on-and-forget backup solution…

@gchen given how relatively often this and similar problems with corrupted chunks occurs, wouldn’t it make sense to add options with which you can

  1. force duplicacy to re-upload a particular chunk?
  2. delete all snapshots containing a specific chunk?

I can add a new backup option to skip the step that builds the known chunk cache from last revision, so every chunk needed by the new backup will be uploaded if it doesn’t exist in the storage.

I won’t do this. We should try to prevent this situation from happening instead. If the storage server is your own, turn on Erasure Coding to avoid potential corruption caused by disk errors. If it is a cloud storage, try to regenerate the chunk, compare the new one with the corrupted one and show them to the cloud provider.

1 Like

So the idea is that in a situation like mine, you would manually delete the corrupted chunks in order to get them re-up loaded (hopefully)?

That sounds like an improvement (though from a user perspective, it’s obviously sub-optimal. But I suppose from a developer-perspective this is the easiest option).

If those corrupted chunks were corrupted because duplicacy did something wrong or if duplicacy could do something to prevent the corruption, I would see your point. But since neither of this seems to be the case, I think you’ll have to elaborate the reasoning on this point.

How exactly would I do this?

So is this something you can implement any time soon?

I’ll work on this next week.

OK, Let me know when it’s implemented so I’ll try it out with my broken chunk. Or wait. As I look at this again, I realize that this probably won’t solve my problem in this case, because I have already tried this:

and the missing chunk was not recreated. So the new feature probably wont lead to a different result.

So I guess the only solution is to

Is this correct, @gchen?

I was able to fix the problem by

  1. manually deleting the corrupted chunk directly on the storage
  2. running check
  3. searching the check-log for “missing” and noting down which revisions in which snapshots had missing chunks
  4. manually deleting those snapshots on the storage.
  5. running check again to confirm that it worked

So while this worked, I’m not sure if there is a more professional way of doing it without fiddling with the storage directly. If there is, please post the instructions below. Otherwise I will get back to this post and follow my own instructions whenever this happens again…

But even though I now have relatively easy instructions to follow (at least in cases with only one or a few chunks affected), I still think duplicacy could do better in helping the user resolve corrupted/ missing chunks.

Does anyone have comments on what I said here:

I didn’t explain my reasoning well but here is my point: corrupted chunks should only happen rarely and I don’t want to provide a convenient option to fix rare situations because such an option can be easily misused causing even more damages.

Is this pcloud? If this happens again you should enable Erasure Coding.

Yes, this is pcloud. Even if this only happens once a year, it is enough of a nuisance to want to fix it.

Are you saying that pcloud supports Erasure Coding or is Erasure Coding provider independent? But enabling erasure coding means I’m basically starting over and lose my previous backups, right?

While I have encountered a corrupted chunk on a rare occasion or two before, and had to take care to delete them manually, more frequently I’ve had situations where Duplicacy/Vertical Backup has created a lot of 0 byte chunks/snapshot files.

Usually this is as a result of out-of-disk space. Now obviously I should monitor disk space and alert before this becomes a problem, but one of the other issues I’m dealing with on one header-less system is Duplicacy (CLI) sometimes not pruning enough of what it should. I can’t yet pin-point why this issue is occurring; only that logs fill up with 'Marked fossil ’ and nothing else (no deleted fossils, no fossil collection saved at the end). As a result, the disk fills up after a couple months and I have to perform an -exclusive -exhaustive prune and everything is right again. Lots of disk space freed, pruning starts working again.

IMO Duplicacy needs an ability to self-heal a backup storage - particularly with 0-byte chunks/snapshots - and possibly provably bad chunks. Otherwise the user needs to manually tinker with said storage - an added pain if it’s cloud-based, since a different tool is necessary - and mistakes can happen.

Instead of deleting bad chunks, could it not rename them to .bad when the user, specifically, asks it to -heal during a check operation?

Then, after a heal, the recommended strategy might be to run the next backup with a new flag (saw it mentioned recently) that skips loading the chunk cache from the previous backup, and doesn’t assume all chunks are present and forcefully checks before each attempt to upload.

This is better than changing the snapshot ID to force a complete re-upload and optionally changing it back, as the user may want to keep the same ID. And better than the user manually deleting chunks from a storage in order to get back up and running again.

Also, the logging related to whether a chunk is file data or metadata could be improved, as the possible remedies change depend on which is which.

2 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

@gchen Is this still planned? (I’m having the same issue again… )

Sorry I didn’t implement that. I’ll do it after the memory optimization PR is merged.

1 Like

4 posts were merged into an existing topic: Memory Usage