Retry on 400 errors?

sevimo · 12 August 2022 17:24

Right now, behavior of on receiving HTTP 400 is to fail the job with “Unexpected response”. However, for OneDrive (and possibly GDrive) these are not entirely unexpected. API does return 400 once in a while for no apparent reason. I believe should do the usual retry logic on 400, I see no reason why it shouldn’t treat 400 the same as any other 400+ error outside of 401,404 and 409 that are handled separately.

The issue is mentioned in https://github.com/gilbertchen/duplicacy/issues/611.

The fix might be straightforward, too. Right now, ODB client runs retry after this check:

} else if response.StatusCode > 401 && response.StatusCode != 404 {

I think it should just be

} else if response.StatusCode >= 400 && response.StatusCode != 404 {

(>=400 check is not really needed, but is kept for clarity)

Thoughts?

gadget · 13 August 2022 01:41

After skimming over that particular section of code, the only wrinkle I can think of at the moment is if Microsoft for some reason reduced the maximum payload per POST request, and/or a user does an duplicacy init -chunk-size 100M which exceeds OneDrive’s current cap of 60 MiB for the API. If doesn’t abort on the 400 error, it’d be looping forever on the same chunk with randomly generated delays in between every retry.

sevimo · 13 August 2022 15:51

It won’t be looping forever, all retries fail after 12 attempts no matter what the reason is.

gadget · 13 August 2022 22:20

Thanks, you’re right, I hadn’t looked far enough outside of that particular if-then block.

After mulling over various scenarios, I’m not sure if I’d rather know immediately that there was a 400 HTTP status, or have a potential problem go unnoticed for a while because it recovers before reaching the maximum of 12 retries.

Imagine a scenario where a 400 error occurs, backs off for a random threshold, retries and delays again, and again, and again before finally being successful on the 10th try. Then the next chunk to be uploaded suffers a similar issue, etc. for thousands of chunks, resulting in a completed backup but it runs so long that it overlaps the next schedule backup.

It’s likely to be rare, but a couple years ago I ran into a similar issue that caused an app I was working on to exceed the daily API request limit due to the repeated retries that wasn’t immediately noticed.

sevimo · 14 August 2022 01:48

You’re making up some hypothetical scenario. Why 400 should be different from any other 400+ error? Because that’s how it would work right now for other errors, including 429 which is actually an error that you get on throttling. You can certainly get this behavior on 429, I’ve never seen that with 400.

I am talking about some real use cases (and not just mine as can be seen in previous discussion and on GitHub) - you have a long-running backups that fail due to a spurious 400 error, and then need to be restarted when and if you notice it, or when the next time the job starts if it is on schedule (and initial uploads are usually not).

Also, scheduled jobs won’t overlap, I asked about that before. If the job is already running it won’t be fired again by scheduler until it is done.

gadget · 14 August 2022 02:33

I’m sorry if my post sounded like I was referring to having seen a throttling issue coupled with a 400 HTTP status code, I haven’t. I was just referring to a similar issue I’d encountered that had to do with status codes causing an app to run longer than intended.

Speaking of which, early last week I was on the receiving end of another similar issue that’s much closer to our discussion. A user has an app that was downloading images from my employers’ web service. The user’s app was submitting malformed requests through our API and ignoring the 400 HTTP status code that was returned. The malformed requests were coming in at a rate of more than 1.2 million times a day with the volume increasing slowly day after day. Although we have load balancers and web server clusters, it’s still wasted resources (not to mention bloating our log files). Of course, I added a firewall rule to at least temporarily block the requests, but firewall appliances don’t have infinite bandwidth and CPU cycles.

That’s true for the web edition, but not for the CLI edition. I’m not trying to argue with you and/or against the suggested change because I think it’s worth considering.

Some of the responsibility is on the cloud storage service providers if they’re returning a 400 response code when it’s not the client that’s in error.

Noted, I’d seen the posts on this forum and GitHub.

Bartlebee281 · 21 December 2023 21:52

Any luck getting this fixed? I’m running the latest version of Duplicacy Web GUI and am trying to back up 5TB to OneDrive for business. It crashes every 200GB or so when I get the 400 Invalid request error.

When I restart the job, it runs for a few hours, then I again get an invalid request error.

How are you all working around this issue?

sevimo · 21 December 2023 22:07

I have submitted a PR with a fix a while ago, it’s up to @gchen to incorporate it (or not) into the mainline.

Bartlebee281 · 23 December 2023 08:45

Thanks for making the fix! Have you tested it - are you getting any more errors?

@gchen - can you please incorporate this fix? I’m trying to upload 5Tb and the app keeps crashing every few hours. Hopefully this fix will solve all my OneDrive woes!

sevimo · 23 December 2023 17:58

If it wouldn’t have worked, I wouldn’t be submitting it So yes, it worked for me, was good enough for 20T+ upload.