“check” command with the “-chunks” option fails

phil1 · 18 July 2022 22:58

The “check” command with the “-chunks” option seems to be very flaky. More often than not it fails, for me. Usually the solution seems to be to just run it again and wait for it to fall over in a different place (after skipping the chunks that are already verified).

Typically I see errors in the log like this:

2022-07-18 23:04:39.156 INFO VERIFY_PROGRESS Verified chunk b56005a73e5cdda0c092be7448db5531cbc38d6c34bcfc0091c21d5e61ebbacf (1610/6200), 2.52MB/s 02:37:20 26.0%
2022-07-18 23:04:39.867 ERROR DOWNLOAD_CHUNK Chunk 042c8cb6d2b40eb8bd3b63f29f5cc5da7450ef3bcc9bac62a10b4900570d0089 can't be found
2022-07-18 23:04:39.876 INFO SNAPSHOT_VERIFY Added 1610 chunks to the list of verified chunks
Chunk 042c8cb6d2b40eb8bd3b63f29f5cc5da7450ef3bcc9bac62a10b4900570d0089 can't be found

However, when I check the storage (on OneDrive in this case), the chunk is there. I’m guessing it’s caused by a communications glitch or a bug in the protocol at either end. It seems intermittent (therefore hard to reproduce) but if I spot a pattern I’ll post back here.

Regardless of the cause, it’s a shame that the “check” utility doesn’t recover, and that it doesn’t log a more accurate error message.

What really worries me is that I have no confidence that a restoral operation will actually work.

saspus · 18 July 2022 23:40

If few things here. All those remotes (google drive, OneDrive) are eventually consistent: meaning if you upload file and then check for it the file may not be there. But eventually will appear. Unless you are hitting this — another possibility is that OneDrive is either flaky (in which case it’s not clear what to do — shall any failed request be repeated?) or it just something nefarious to pretext against abuse.

Does it also happen if you use just 1 thread? Generally, OneDrive tolerates 2 but more than that — it will freak out.

phil1 · 19 July 2022 00:13

Interesting, thanks for the info, appreciated. How long is “eventually” - hours, days? My gut feel is that’s not the issue because I think these are old chunks (the specific one above was from January).

Protection against abuse feels more likely, because the problem often seems to happen after about an hour or 1500 chunks.

Using multiple threads definitely causes all sorts of issues and error messages. The example above was using two just to speed up the diagnostics but normally I just use one and I’ll return to that.

phil1 · 19 July 2022 03:29

It definitely fails if the computer is allowed to sleep.

The error message “connection was forcibly closed by the remote host” in this case is untrue and misleading.

2022-07-19 04:23:37.668 ERROR DOWNLOAD_CHUNK Failed to download the chunk 7259444a0d0af7a3df8c2db8d05c4b21d52d9e0dd90fb3a01c2f876540142487: read tcp 192.168.1.219:57989->13.107.42.12:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022-07-19 04:23:37.673 INFO SNAPSHOT_VERIFY Added 447 chunks to the list of verified chunks
Failed to download the chunk 7259444a0d0af7a3df8c2db8d05c4b21d52d9e0dd90fb3a01c2f876540142487: read tcp 192.168.1.219:57989->13.107.42.12:443: wsarecv: An existing connection was forcibly closed by the remote host.

saspus · 19 July 2022 06:15

This is very interesting. Actually, it’s likely correct error message from the Duplicacy’s perspective: it runs as if nothing happened, unaware that the world around and including it stopped, and resumed, but suddenly the server does not want to talk to it anymore and closes the connection (because the session instantly expired, or the tokens are now ancient). Duplicacy cannot know that, time was stopped for it.

Maybe it can subscribe to system events and listen to what the system is doing, and re-establish all connections if wake is received, but I guess easier would be to just retry few times on these forceful connection closures.

phil1 · 19 July 2022 15:08

I’m not really a Windows programmer but I would guess your statement “Duplicacy cannot know that” is probably untrue.

Anyway, I’m having some success with reproducing the issue - it seems to be 100% reproducible if there are more than about 1,600 unverified chunks in a storage and the storage is located on OneDrive or Wasabi (regardless of age or number of threads).

If my hunch is correct, that means that large restore operations will probably fail repeatedly as well before they finally succeed, which is unfortunate to say the least. I will test that next, but it will take several days.

Update: It’s going to take about a month at the current rate. Why is it so slow? If restoring a backup takes as long as checking it, cloud backups will be completely useless to me.

sevimo · 19 July 2022 18:25

This sounds like a throttling issue to me. I don’t know how reacts to it, but after hammering OneDrive with rclone you will see a number of 429 errors (meaning too many requests), and the only way to deal with it is to slow down. RClone does it automatically, but slow downs could be into minutes. If I do a massive OD operation with multiple threads, things go really fast in the beginning, but after a while it stalls due to throttling, and can take a while to continue as it is all stop and go.

If doesn’t handle throttling gracefully, I can imagine you might see something like you observe.

phil1 · 20 July 2022 02:08

It may be coincidence, but it seems to always timeout after an hour rather than after a specific number of chunks. If there is more than one storage in the scheduled task list it will move on to the next one and the timer apparently restarts.

Sometimes the status shows “Failed” instead of “Missing chunks” in the Schedule window but I can see no significant difference between these two cases in the logs or the settings.

gadget · 20 July 2022 03:37

Have you tried having Duplicacy run a benchmark to test your connection to OneDrive? Sometimes where you are on the internet relative to the particular data center you’re connected to makes a huge difference.

saspus · 20 July 2022 03:45

*Drive-type storages are not optimized for bulk storage operations that duplicacy requires. Either on purpise, to discourage abuse, or accidentally, as a result to compromises having been made to suppprt a different usecases, such as collaboration and file sharing, not general purpose bulk storage like S3. OneDrive is in fact one of the worse one in terms of tolerating such abuse. Along with DropBox.

In my opinion duplicacy shall drop support for these services as they don’t work for reasonably large backups (I know some people, including myself, had better success with Google Drive, but this does not really change the reasoning). This will likely never happen, because it looks good on paper and even works ok for small datasets.

gadget · 20 July 2022 03:57

That does sound like you’re hitting the expiration on the API access token. Take a look at the contents of the JSON file you originally downloaded for OneDrive. There’s an “expiry” field (UTC timezone) that’s good for one hour from the time it was issued.

Each access token includes an embedded “refresh” token that grants permission to download a new token (also good for one hour from issuance). Perhaps the refresh is failing, resulting in Duplicacy being kicked off OneDrive.

phil1 · 20 July 2022 14:24

What you say is entirely plausible, but isn’t the whole point of Duplicacy to enable backups to such services?

I’ve had worse problems with Google Drive in the past - that was my original reason for trying Duplicacy, in fact. UpdraftPlus and Google’s own client are both very unreliable when trying to backup a website or a PC to there - backups seem to succeed but restoring often fails. I’ve had similar difficulty with Amazon S3 - I had to pay for “accelerated access” to solve that.

phil1 · 20 July 2022 14:32

The expiry field in my access token is back in March:

“expiry”:“2022-03-25T22:05:13.640593904Z”

It does sound as if the refresh is failing. Should I report that as a bug?

gadget · 20 July 2022 19:12

While Duplicacy, Duplicati, Arq and other apps enable backing up to various cloud storage providers – whether a given cloud storage service is suitable for a particular backup use case is the question.

OneDrive, Google Drive, Dropbox, Box and other similar file sharing cloud storage services were designed for intermittent uploads rather than bulk uploads of thousands of files at a time.

Depending on your backup needs and budget, check out rsync.net. While not the least expensive per gigabyte by a long shot, there’s no bandwidth throttling, no ingress/egress charges, and the backend is simple SFTP onto ZFS. It’s been a while since I’ve used them, but the service was great (one of the features I really liked was the option to create multiple user accounts with individual passwords and file permissions).

gadget · 20 July 2022 19:20

It makes sense that the original access token you downloaded would have an old expiry date. The refresh_token embedded inside allows the ability to download a new access token; the new access token would then also contain a new refresh_token; allowing for a continuous chain of refreshes.

I don’t know if your particular situation is due to a bug in Duplicacy. The next step I would take is to download a refresh access token – https://duplicacy.com/one_start – to see if that helps.

phil1 · 20 July 2022 21:43

I’m not sure why a fresh token would make a difference, but I’m trying it anyway…

gadget · 20 July 2022 23:37

First, with a fresh access token you’re assured that Duplicacy will be granted access to your OneDrive account (since we already know that the token file on your computer expired back in March).

Second, the fresh token will include an equally fresh “refresh_token” that’s required for downloading a new token after 60 minutes (according to Duplicacy’s source code, it connects to https://duplicacy.com/one_refresh to download new tokens).

Third, it helps rule out a problem on the remote storage end. Cloud service providers often use load balancing to distribute users across multiple servers and/or data centers (there’s a practical limit to how big a single server can be).

Unless you’re paying big bucks for service, your data isn’t going to be replicated to every data center Microsoft operates. Microsoft says OneDrive data “[…] is mirrored into at least two different Azure regions, which are at least several hundred miles away from each other, allowing us to mitigate the impact of a natural disaster or loss within a region.” In the United States, Microsoft’s current data centers are Boydton, Cheyenne, Chicago, Des Moines, Quincy, San Antonio, Santa Clara and San Jose.

Microsoft’s load balancer is going to route Duplicacy to a data center that’s holding a copy of your OneDrive account. The choice of data center might be some combination of availability, proximity, network path, system load, etc. When Duplicacy uploads chunk #2, Microsoft’s load balancer is likely going to gravitate toward the same data center chunk #1 was sent to because there’s already a running OneDrive server process ready to receive more data from you rather than routing to a different data center that requires spawning a new server process on an available server node.

Because an access token is a pass into your OneDrive account, a fresh token is effectively a different user (there could be multiple applications accessing your OneDrive account). Microsoft’s load balancer will treat it like any new login and determine which data center is most appropriate. Even if the new login ends up at the same data center, the odds of it also ending up on the same server node are very small (large data centers have huge server farms).

The problem you’ve been having could simply be due to a flaky server, or some kink in a complex chain of connections. Using a fresh token is almost guaranteed to get you onto a different server node and/or data center (which in turn will have a different network path).

(I know nothing about Microsoft’s internal data center operations. The above is just speculation based on personal experience as a system administrator in a data center.)

phil1 · 21 July 2022 00:31

First, thanks for all that information. I wasn’t aware that tokens were refreshed and I think that was the crux of the “missing chunks” problem.

Second, it appears to be working - the check is still running more than an hour later and the refresh timestamp in the token has incremented by an hour. So, kudos for correctly spotting the problem and many thanks for your help.

(I was suprised that the expiry time still appeared to be in the past until I remembered daylight saving time. My excuse is I live within walking distance of the Greenwich meridian.)

I’m nevertheless still a bit surprised that renewing the token has made a difference, but I think there may be a simpler explanation. I put the new token in a more convenient location (the root of my Windows user account) whereas previously I think it was in a location that required admin access (kind of deliberately). Having said that, I couldn’t find the location in the config file so I’m not 100% sure where the original token was being picked up from.

I also have “controlled folder access” enabled in Windows 11 that might have had an effect. Now that I know the token has to be writeable I’ll keep an eye out for notifications about that.

phil1 · 21 July 2022 20:18

Still getting comms failures, but less frequently now.

2022-07-21 18:06:25.455 ERROR DOWNLOAD_CHUNK Failed to download the chunk e49a039396dced71a8567acbe7c41e6a6a41958c0c6833589607da542949350d: read tcp 192.168.1.219:55421->13.107.42.12:443: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-07-21 18:06:25.483 INFO SNAPSHOT_VERIFY Added 2311 chunks to the list of verified chunks
Failed to download the chunk e49a039396dced71a8567acbe7c41e6a6a41958c0c6833589607da542949350d: read tcp 192.168.1.219:55421->13.107.42.12:443: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

gadget · 22 July 2022 05:00

Should definitely run a Duplicacy benchmark to test the connection to OneDrive because you’re getting network timeouts.

It’d also be helpful to run some tests with iperf.

Example network benchmark test:

iperf3 --reverse --time 60 --interval 5 --client <test_server>

Example network jitter test:

iperf3 --reverse --time 60 --interval 5 --client <test_server> --udp --bitrate 10M

Replace <test_server> with a public test server closest to your location: iPerf - Public iPerf3 servers and adjust the bitrate for the jitter test as needed (e.g., 10M = 10 mbps).

Sometimes a network connection will seem perfectly fine when web browsing and streaming audio/video because losing a network packet here and there isn’t noticeable. But the same thing would be unacceptable for a backup program which must be much more sensitive to network issues.