Check -chunks on Google Drive: Failed to decrypt the chunk <hash>: cipher: message authentication failed

GuillaumeBoudreau · 6 October 2020 13:56

(My problem is apparently the same as On check: Failed to decrypt the chunk <hash> cipher: message authentication failed; retrying
But I didn’t want revive a Feb 2018 thread.)

I backup various things to Google Drive, using 6 different backups that share the same target (i.e. use the same encryption password).
And I run check after each backup, but never ran check -chunks before, because my chunks folder on Google Drive is 5.6 TB.
A few days ago, I started running a check -chunks -r <latest_revision> on all 6 backups (sequentially).
That found 4 corrupted chunks (plus a 5th I am still working to fix, so let’s ignore that one for now).
Most (I think for 4 out of 5) show the following error:

2020-10-06 02:49:45.834 WARN DOWNLOAD_RETRY Failed to decrypt the chunk 6e134f4ea72cee3f8c4d64373f9ebeb60f706a0e43422ba97f627f49a7505f67: cipher: message authentication failed; retrying
2020-10-06 02:49:48.033 WARN DOWNLOAD_RETRY Failed to decrypt the chunk 6e134f4ea72cee3f8c4d64373f9ebeb60f706a0e43422ba97f627f49a7505f67: cipher: message authentication failed; retrying
2020-10-06 02:49:49.609 WARN DOWNLOAD_RETRY Failed to decrypt the chunk 6e134f4ea72cee3f8c4d64373f9ebeb60f706a0e43422ba97f627f49a7505f67: cipher: message authentication failed; retrying
2020-10-06 02:49:51.054 WARN DOWNLOAD_DECRYPT Failed to decrypt the chunk 6e134f4ea72cee3f8c4d64373f9ebeb60f706a0e43422ba97f627f49a7505f67: cipher: message authentication failed

I was able to recreate those chunks by renaming them to <hash>.trash on Google Drive, changing the backup id in the preferences file, and executing backup once. (I cancelled the backup as soon as the 4 chunks were uploaded, and changed back the preferences id value.)

Some of the corrupted chunks were created in 2018, others in July of this year:

-rw-------  1 gb  staff   6872040 11 Sep  2018 6e/134f4ea72cee3f8c4d64373f9ebeb60f706a0e43422ba97f627f49a7505f67.trash
-rw-------  1 gb  staff   6947622  4 Dec  2018 d7/c6ac07da0ae8196474aa0c25798aa87577beb4e4e11a8ade0f978af78aa3c2.trash
-rw-------  1 gb  staff  16693798 12 Apr 04:59 0b/fc236e0d5ba29dd6d81a9fd5e449824a1b81c68742c350d183128963638e5d.trash
-rw-------  1 gb  staff   3735846  7 Jul 11:47 f8/e966fe94683f58f72bb7f78ad2ea17ef1269d0495d185acf85c069a5a9f096.trash

One of the chunks (6e) has a different file size; the others are the same size before (corrupted) and after (ok). Here’s the ok chunks files’ sizes:

-rw-------  1 gb  staff   6872102  5 Oct 08:52 6e/134f4ea72cee3f8c4d64373f9ebeb60f706a0e43422ba97f627f49a7505f67
-rw-------  1 gb  staff   6947622  5 Oct 11:22 d7/c6ac07da0ae8196474aa0c25798aa87577beb4e4e11a8ade0f978af78aa3c2
-rw-------  1 gb  staff  16693798  5 Oct 11:26 0b/fc236e0d5ba29dd6d81a9fd5e449824a1b81c68742c350d183128963638e5d
-rw-------  1 gb  staff   3735846  5 Oct 10:45 f8/e966fe94683f58f72bb7f78ad2ea17ef1269d0495d185acf85c069a5a9f096

I compared the hexdump of the corrupted and ok chunk 6e (the one with a different file size):

$ hexdump -C 6e134f4ea72cee3f8c4d64373f9ebeb60f706a0e43422ba97f627f49a7505f67.trash | head -3
00000000  64 75 70 6c 69 63 61 63  79 00 a2 88 0c a0 dd 19  |duplicacy.......|
00000010  f9 5d df d9 79 f3 ef 7e  44 a9 06 1f e6 10 1d ef  |.]..y..~D.......|
00000020  be ed 18 7a eb 4a c7 5f  d3 4f 2e d0 a5 47 7c 47  |...z.J._.O...G|G|

$ hexdump -C 6e134f4ea72cee3f8c4d64373f9ebeb60f706a0e43422ba97f627f49a7505f67 | head -3
00000000  64 75 70 6c 69 63 61 63  79 00 e5 fb 23 72 64 26  |duplicacy...#rd&|
00000010  56 24 02 36 3e b5 a9 9c  26 45 45 f7 c8 cf 4e cc  |V$.6>...&EE...N.|
00000020  aa 38 6f 77 ce f0 58 51  1c d0 0f 5f b9 04 c4 b5  |.8ow..XQ..._....|

Then I compared the beginning of the hexdump of the 3 other corrupted chunks, versus their OK counterpart. They seem to differ in the same way:

$ hexdump -C d7c6ac07da0ae8196474aa0c25798aa87577beb4e4e11a8ade0f978af78aa3c2.trash | head -3
00000000  64 75 70 6c 69 63 61 63  79 00 a2 c8 fb 6a 59 23  |duplicacy....jY#|
00000010  90 03 ec 06 a3 e3 dc 36  f4 e4 87 cb 70 e6 62 3b  |.......6....p.b;|
00000020  0c b0 25 69 e0 e3 9c 37  3c ec 1b ad 7d c0 0e fb  |..%i...7<...}...|

$ hexdump -C d7c6ac07da0ae8196474aa0c25798aa87577beb4e4e11a8ade0f978af78aa3c2 | head -3
00000000  64 75 70 6c 69 63 61 63  79 00 17 cd ac a2 75 38  |duplicacy.....u8|
00000010  c6 cd e6 41 d6 d0 0d 0d  fc 15 0b 2e 56 f3 ce c6  |...A........V...|
00000020  80 b5 c3 bb 89 49 2d 0c  f6 c8 f3 53 86 12 c5 e2  |.....I-....S....|

$ hexdump -C 0bfc236e0d5ba29dd6d81a9fd5e449824a1b81c68742c350d183128963638e5d.trash | head -3
00000000  64 75 70 6c 69 63 61 63  79 00 f5 06 ee 57 34 02  |duplicacy....W4.|
00000010  e1 a5 d4 d4 2b 58 64 7c  3d fb 38 49 84 44 c9 f4  |....+Xd|=.8I.D..|
00000020  2c 83 00 4f 81 5c 50 7d  1b 8d 4c f1 4a 52 d5 fd  |,..O.\P}..L.JR..|

$ hexdump -C 0bfc236e0d5ba29dd6d81a9fd5e449824a1b81c68742c350d183128963638e5d | head -3
00000000  64 75 70 6c 69 63 61 63  79 00 05 c8 db e3 55 47  |duplicacy.....UG|
00000010  55 cb ac 73 c4 15 43 6c  bd 72 ca 80 26 fd e8 3d  |U..s..Cl.r..&..=|
00000020  86 e7 bb b5 3f be 46 49  9d 8f 07 96 66 11 9a 2e  |....?.FI....f...|

$ hexdump -C f8e966fe94683f58f72bb7f78ad2ea17ef1269d0495d185acf85c069a5a9f096.trash | head -3
00000000  64 75 70 6c 69 63 61 63  79 00 60 d4 d9 53 01 88  |duplicacy.`..S..|
00000010  75 f6 13 5f 45 da d3 ee  2f 58 bf 6b dd d6 f0 7c  |u.._E.../X.k...||
00000020  c5 c0 f2 1f 9f d2 d2 29  18 00 22 4a a3 90 2d ec  |.......).."J..-.|

$ hexdump -C f8e966fe94683f58f72bb7f78ad2ea17ef1269d0495d185acf85c069a5a9f096 | head -3
00000000  64 75 70 6c 69 63 61 63  79 00 16 86 68 12 a7 71  |duplicacy...h..q|
00000010  55 0d df b3 2a f3 0a 40  6b d7 ab db 43 bb 52 b2  |U...*..@k...C.R.|
00000020  fd 2a d1 60 24 79 97 82  23 21 71 3a af e7 2d c8  |.*.`$y..#!q:..-.|

That seems pretty different… Not sure what could have caused this corruption…

I checked the Manage versions of the 4 failed chunks on Google Drive; all of them show just one version (ie. they were written once, and never changed afterwards).

I have 740,000 chunks files (in my most recent revisions). 4 out of 740,000 is a pretty low number, but it is still worrying, because the only way to notice something is wrong is to re-download everything.

Sadly, I didn’t keep the backup logs of when those 4 chunks were created (I only kept 1 month of logs… Changed to to kepp all logs now.)

I will try to run a check -chunks (for all revisions, not just the latest). Maybe I’ll be “lucky”, and find a corrupted chunk created in the last month… But it took more than 36h to complete a check of just the last revisions (with 20 threads, totaling 20-25MB/s); not sure how long it will take to check all revisions.

Ideas or pointers on how this could be debugged further?

Thanks.

gchen · 6 October 2020 16:39

When an encrypted chunk is regenerated a different nonce will be used so the content of the new chunk will be completely different.

For corrupted chunks that don’t change in size, it is possible to use a brute force method to detect a bit rot – flip every single bit one by one and pass the slightly modified chunk to the decryptor each time. I’ll write a small tool to do this.

GuillaumeBoudreau · 6 October 2020 19:26

Thanks.
Let me know what else I could try, if anything (I’m a developper, but not a Go developer; can still setup a dev environment and try some changes to the codebase, if needed).

I started a check -chunks -stats; current estimate shows 2d 2h left (thx for adding that progress log in 2.7.0 btw).
I’ll be monitoring the log of that check to see if any corrupted chunk is found.

gchen · 9 October 2020 00:35

I implemented the brute force tool to detect the bit rot but found it is too slow to be useful – the rate is about 1M bytes per day. Considering you also have a corrupt chunk with a different size, it is unlikely to be a single bit rot for other chunks.

So I think the best way going forward is to run check -chunks frequently. I just checked in a commit to make check -chunks incremental. On every run the CLI will verified chunks to a local file under .duplicacy/cache and on subsequent runs it will skip chunks in this file.

GuillaumeBoudreau · 9 October 2020 00:56

That seems like a good idea.
If anyone wants to re-check all the chunks, he could wipe that file, and re-run. Maybe adding a command-line parameter to skip the local cache would be a good idea, in order to document this behaviour? Something like duplicacy check -chunks -ignore-checked-cache

Is there a way to convert chunk hashes to chunk IDs? I have a log of my 50+ hours check -chunks -stats run that just finished testing 1,000,000 chunks (ran at 200mbps continuously for 2+ days), and if I could convert the hashes logged in the VERIFY_PROGRESS lines into the chunk IDs that would be expected in .duplicacy/cache/<storage>/verified_chunks, I would save a lot of time!

gchen · 9 October 2020 01:19

What you see in VERIFY_PROGRESS are chunk IDs. You just need to write them in this format:

{
    "id1": t,
    "id2": t,
    ...
}

Where t is a int64 integer representing the unix time when the check is done.

GuillaumeBoudreau · 9 October 2020 15:32

Thank for that.

With that info, I was able to generate a (76MB) verified_chunks file, from my previous check -chunks logs, and I placed it in .duplicacy/cache/default/verified_chunks
I re-ran check -chunks (with a custom-built duplicacy binary), and got this:

Command: /usr/local/bin/duplicacy -log check -persist -r 33 -chunks -threads 24
2020-10-09 11:12:23.399 INFO STORAGE_SET Storage set to gcd://duplicacy-backups
2020-10-09 11:12:24.961 INFO SNAPSHOT_CHECK Listing all chunks
2020-10-09 11:15:15.648 INFO SNAPSHOT_CHECK 1 snapshots and 1 revisions
2020-10-09 11:15:15.687 INFO SNAPSHOT_CHECK Total chunk size is 5486G in 1084123 chunks
2020-10-09 11:15:17.791 INFO SNAPSHOT_CHECK All chunks referenced by snapshot fileserver3-duplicacy-backups at revision 33 exist
2020-10-09 11:15:22.685 INFO SNAPSHOT_VERIFY Verifying 254162 chunks
2020-10-09 11:15:23.360 INFO SNAPSHOT_VERIFY Skipped 249796 chunks that have already been verified before
2020-10-09 11:15:24.067 INFO VERIFY_PROGRESS Verified chunk 16feae8e838a850b19b5c85cfca75d6e2ecee4d6bb8abd2ad7edaa490929f6ef (1/4366), 2.88MB/s 01:40:34 0.0%
...
2020-10-09 11:30:17.421 INFO SNAPSHOT_VERIFY All 254162 chunks have been successfully verified
2020-10-09 11:30:21.161 INFO SNAPSHOT_VERIFY Added 4366 chunks to the list of verified chunks

So yay, duplicacy downloaded 4,366 chunks instead of 254,162 to check this revision, i.e. 22 GB instead of 1.3 TB!

TheBestPessimist · 10 October 2020 05:40

adding a command-line parameter to skip the local cache

I think it’s better to reset it, which implies retesting all the chunks again.