Restore Timeouts with Azure

spruit · 13 July 2019 16:02

I may have accidentally deleted a large folder the other day (10+GB in a 1TB backup set), and I’m attempting to restore it. The storage is Azure, and I’m consistently getting timeouts during the restore. This has caused no end of pain as each time I try to redo the restore it has to start all over downloading chunks. I’ve tried with default # of thread, 1, 2, 8, 16. It doesn’t seem to matter, but I do see latency spikes on the Azure storage account.

Is there some way to have better retries, as this is effectively making the restore impossible (I’ve successfully restored individual folders within the large folder, but that seems like a hack at best when I need the full folder back).

Downloaded chunk 1588 size 5043711, 16.47MB/s 00:58:57 12.4% 
Downloaded chunk 1589 size 13436099, 16.46MB/s 00:58:58 12.4%
Downloaded chunk 1590 size 1204559, 16.47MB/s 00:58:57 12.4%
Downloaded chunk 1592 size 3409888, 16.47MB/s 00:58:55 12.4% 
Failed to download the chunk 7924d6832c030d69357696b993b5ec4e1966267cbd6c16cdbee00bd5b3c8096d: read tcp 172.17.0.3:34068->IP:443: read: connection timed out

spruit · 13 July 2019 16:38

I see that by not using the -stats option that it’s continually starting over where it left off (yay to not dl’ing the same files again). However, when you have 1M+ files in a folder, having to continually rerun the command again is not a great user experience.

My only guess is that due to the size of some of the files (small) and how many there are (lots), some sort of storage limits may be getting breached causing API cooldown which results in a connection timeout.

gchen · 14 July 2019 00:58

Can you try rate limiting the download a bit using the -limit-rate option? I guess if it was Azure that was rate limiting the download it would have returned some HTTP status code like 429 rather than messing up the connections. It may be something else, like a network issue.

spruit · 15 July 2019 20:16

I’ll play with it a bit tonight to see if that helps at all. I have a similar issue at times with upload, but as it’s not pushing nearly as much data, doesn’t happen as often.

It could be that the SDK you forked is now over a year old, and may have improvements to retry too (there were issues in the Java version that were similar, but never found a related issue in the Go SDK). The latest can be found at GitHub - Azure/azure-storage-blob-go: Microsoft Azure Blob Storage Library for Go.

spruit · 16 August 2019 04:10

Ok, so I’ve finally had some time to work on this. However, it’s not related to restores (finally just kept at it until it finished), but instead for copies as I’m thinking of bringing on another cloud storage vendor.

For some background, I have ~1Gbps synchronous internet . I am backing up to 1 Azure storage account with a single container (root folder). In that container I have ~970GB of data in ~250 revisions and ~222500 chunks. My backup runs hourly with a daily check and weekly prune. This has been working consistently for awhile (maybe 1-3 failures due to TCP resets throughout a week), and I typically update ~50-100MB/hour. To test copying ability I created a new container in the same storage account - NOT bit-identical. Azure specifies a max of 50MB/sec bandwidth for a storage account (theoretically my network could saturate that). I’m also running the Web UI in a docker image (saspus/duplicacy-web).

I attempted to run the initial copy and it continually kept sending the TCP reset at various times during the initial copy of all the revisions. To get around this and get them up to the same I spun up an Ubuntu VM in Azure and did the copy via the CLI. It completed successfully with no TCP resets on the first run - yay.

Now, I want to keep my backup and copy sync’ed. On my hourly schedule I have the regular backup that runs and then the copy that runs (NOT in parallel). I have had very little luck getting the copy to complete successfully without having a TCP reset that isn’t retried. In the image below you can see the initial copy fail on the 4th row along with all the hourly failures.

As per the recommendation I’ve been playing with the -upload-limit-rate and -download-limit-rate settings. After the few hourly failures I tried 20000 for each (20MB/s) which theoretically would’ve been less than the 50MB/sec max. After that failed for a over night, I moved it down to 10000 for each, then 1000, then 500, then 100, and finally I’ve set it to 5 and I am still receiving the same errors. During all of this I have not added the -threads options. It does not appear to be related to the rate.

As mentioned the Java SDK seemed to have a similar issue with retries that was fixed earlier this year - https://github.com/Azure/azure-storage-java/issues/363#issuecomment-480969209. It’s possible that moving to a new version of the SDK may help with these due to both sync and async IO improvements.

spruit · 16 August 2019 04:21

It appears to be when it actually attempts to start moving Chunks, after it’s found which revisions it’s already done. This is the last failure with the options of “-upload-limit-rate 5 -download-limit-rate 5”

2019-08-15 21:05:26.598 INFO SNAPSHOT_EXIST Snapshot nas at revision 2417 already exists at the destination storage
2019-08-15 21:05:26.609 INFO SNAPSHOT_EXIST Snapshot nas at revision 2418 already exists at the destination storage
2019-08-15 21:05:26.620 INFO SNAPSHOT_EXIST Snapshot nas at revision 2419 already exists at the destination storage
2019-08-15 21:05:26.632 INFO SNAPSHOT_EXIST Snapshot nas at revision 2420 already exists at the destination storage
2019-08-15 21:19:14.399 ERROR DOWNLOAD_CHUNK Failed to download the chunk 4a6dc1fa38f9a9a1d6c95fe1f3894cc1dc5ac23ad4c8a05ec913ddc0e8bfae8d: read tcp IP:Port->AzureIP:443: read: connection reset by peer

Droolio · 16 August 2019 14:06

Are you always doing a copy after a backup? I wonder if you’re backing up at full wack (without the -limit-rate option) and then going straight into a copy and it’s throttling you because of that?

Maybe you may have to wait a little while for them to back off any limits they’re applying to your account due to exceeding it in the last whatever many hours? Tis a wild stab.

Might be counter-intuitive but have you tried specifying -threads 4 or something small at first?

spruit · 16 August 2019 14:17

So last night before I went to bed I set the copy job at -upload-limit-rate 20000 -download-limit-rate 20000 -threads 1 while leaving everything else exactly the same and it completed successfully for the 8 or so times it ran and took about the same amount of time to run as the failed jobs (probably because the vast majority of chunks are skipped anyways).

I’m going to remove the upload and download limits today while I’m at work to see if it’s just the the threads…and associated connections with those threads.

I had assumed that 1 thread was the default as it’s not explicitly mentioned anywhere in the help docs, but that doesn’t seem the case. Can those be updated with what the defaults are for each command parameter? It’d also be good to know how many connections per thread are opened to storage providers. Is it a 1:1 or something else?

spruit · 17 August 2019 03:33

After removing the upload and download limits, but keeping -threads 1, I only had 1 failure in over 10 runs. Seems like its related to the number of connections/threads.

Droolio · 17 August 2019 13:26

Fairly sure the default thread count for all storages is 1, so it was probably just a fluke or they removed the throttling by the time you retried.

jani · 1 March 2020 11:20

I’m seeing this as well. I’m trying to do an initial upload of my backups to Azure, but I cannot seem to finish it. Thanks for the -threads pointer, I’ll be trying it out.

It seems there are some limits on Azure indeed. I found this from Google:

Initially I was uploading blocks of 2MB in size. Doing about ten in parallel. Uploading that took little bit under a two minutes. Nothing interesting. At least I thought. By a coincidence I changed the block to 512KB and used only four connections. As it turned out that solved my problem with remote host (Azure) forcibly closing my connection.

https://www.tabsoverspaces.com/233437-investigating-the-an-existing-connection-was-forcibly-closed-by-the-remote-host-on-azure-blob-storage

EDIT: Oh, I forgot the error I’m getting:

Failed to upload the chunk 9711c...: Put https://x.blob.core.windows.net/duplicacy/chunks/97/11c...: net/http: HTTP/1.x transport connection broken: write tcp 192.168.1.2:54521->20.38.103.68:443: wsasend: An existing connection was forcibly closed by the remote host.

jani · 4 March 2020 18:29

Hmm, I think I solved this by upgrading my blob storage account to general purpose v2 account. At least the backup now went through without any -threads or -limit-rate flags.

gchen · 11 November 2020 15:51

A post was split to a new topic: Restore Timeouts with Azure or Onedrive