Azure Storage - When Network Errors Occur, Process Corruption Occurs

Don_Scott · 1 November 2024 20:23

Please describe what you are doing to trigger the bug:

This bug is readily reproducible, although it confused me at first. The observation I am having is that when I am restoring my large backup (80-100 TiB), if there is ANY network issue (e.g. read timed out) the restore process becomes corrupted. This corruption manifests in subsequent chunks starting to error out, until the process actually crashes after ~1 hour.

Please describe what you expect to happen (but doesn’t):

When a network error occurs, my expectation that any interrupted / in-process work is purged, and restarted without error. Ideally with some kind of exponential backoff – but at a minimum, the process shouldn’t be corrupted.

Please describe what actually happens (the wrong behaviour):

This will manifest in messages like “Failed to decrypt the chunk dcdf824930f8debed7e87dc828186eec1b789458e0c32006720692f0dfb9c02a: cipher: message authentication failed; retrying” or “Failed to decrypt the chunk ee94639e9d7048ef9d09c3ad5d440710f1bd4482132acfc317ea983ec7096d78: The storage doesn’t seem to be encrypted”

This issue is resolved by killing the process and restarting when any network issue occurs.

Here is an extended log as an example: Log Example - Duplicacy · GitHub

Version: 3.2.3 (254953)

gchen · 1 November 2024 21:27

How big is revision 119? If you have enough local storage, you can run duplicacy copy to copy all chunks that belong to this revision to a local storage and then run a restore against this local storage. This should be much faster that a direct restore.

Don_Scott · 1 November 2024 21:57

@gchen Not exactly positive, but this is almost a full restore, so I would estimate ~80TiB needs to be downloaded (gigabit fiber in place). The ZFS array has 323T of space, so a copy is definitely an option. I was seeing ~20-30 MB/s (significantly below gigabit) on restore.

Any advice on parameters I should set or commands I should run to plan this?

Specifically – my azure storage is PW protected and has a key. Should I create a bit-identical copy (with the same key) or just have a non-encrypted local copy. (Is one approach better? – If it’s possible to use the rsync tools to copy blobs that will probably be the best…)

gchen · 1 November 2024 22:49

I think -bit-identical should be used. With this option you can use rsync/rclone to copy chunks, but you need to figure out which chunks to copy. duplicacy copy is so much simpler and support multithreaded downloading.

Here are the commands to run:

mkdir -p \path\to\local\storage
mkdir -p \path\to\restore\directory
cd \path\to\restore\directory
duplicacy init backup_id azure://storage
duplicacy add -copy default local -bit-identical backup_id \path\to\local\storage
duplicacy copy -from default -to local -r 119 -download-threads 40
duplicacy restore -r 119 -storage local

Don_Scott · 1 November 2024 23:51

I did a bit of testing and found that I needed to add -threads 16 to get the into the 900 mbps territory. Based on current speed it looks like I will be able to test a restore in a little over a week.

I will follow up!

Droolio · 4 November 2024 16:01

Incidentally, if you have a local copy of much of the source files (partial or old), you could ‘pre-seed’ a local backup storage, that you can ‘top up’ from the remote storage.

All you’d have to do is make a copy-compatible storage - i.e. create a new (empty) local storage, using the remote storage as a copy-compatible template. Then backup those source files to that local storage using temporary IDs; this all happens locally. As a result, you’d have a bunch of content-defined chunks which a final copy from the remote can fill in, skipping chunks you already have.

Obviously, this requires you to have at least some of the original files to make this worthwhile - depending on how much you have, could save quite a bit of time.

Don_Scott · 9 November 2024 17:18

@Droolio and @gchen So the backup is now local and I am observing disappointing restore speeds. To elaborate:

Mostly empty large ZFS array (virgin) so no frag, 1M block sizes.
Robust ARC / ECC ram (>500 GB)
I am seeing the same behavior at 64 threads, and 128.

Behavior:

The restore is flying along 450-550 MB/s, then it will suddenly pause/hand for about 30-60s.
No IOWAIT (ZFS reports idle, with very little queued activity).
The pause is preceded by a small (~1-2k) surge in write operations for 1-2s, which I see in the queue.
During the pause, CPU is practically idle except for one core – which is maxed.
AVG restore speed is 150 MB/s with the pauses. This array can easily handle 600-750 MB/s sustained writes.
Reducing the thread count to 16 does not eliminate the pauses, and results in a considerable reduction in throughput.

Any guesses on what could be happening? It seems like some kind of single-threaded program behavior is causing the restore process to pause. GC?

gchen · 10 November 2024 02:09

64 threads are definitely too many. The restore is basically single threaded in the sense that only one thread writes to the disk to restore files one by one while there are multiple download threads retrieving chunks from the storage. Try 4 threads to see if that helps.

You can also run multiple restore jobs, one for each subdirectory, if that is possible.