Does downloading chunks need to GetFileInfo() on every chunk?

Icydog · 15 March 2023 05:16

It looks like when downloading chunks, FindChunk() is called, which in turn calls GetFileInfo(): duplicacy/duplicacy_chunkoperator.go at master · gilbertchen/duplicacy · GitHub

From my reading of the code, in this read path, FindChunk() is used to compute the file path + determine if the file exists (via GetFileInfo()).

So my question is: in this code path, is GetFileInfo() necessary? Can the code be optimized to eliminate this call? It seems that the DownloadFile() call will already tell you whether the file exists, via 404 or some other error. Perhaps instead of GetFileInfo(), an eager DownloadFile() could be called optimistically instead.

The reason I ask is because GetFileInfo() makes an API call, which could be billable. In my case, I’m using Dropbox, which doesn’t charge money for API requests but does throttle based on number of API requests per time period. Removing this call would halve the number of API calls and double download throughput.

sevimo · 15 March 2023 17:29

This sounds unlikely, retrieving file info is likely to need significantly less time than downloading 4MB avg chunk, unless you have very high latency and very high bandwidth connection.

Having said that, doing what you’re asking is possible, but is probably messier than you anticipate. Detection of “file not found” condition is storage-specific, and is encapsulated within GetFileInfo for each storage. DownloadFile as is won’t tell you whether or not file exists, it will return a (storage-specific) error that it will simply bubble up. Caller of DownloadFile has no good way of determining if it is due to file not being there, or otherwise. That’s not to mention that downloading will need to incorporate functionality for locating chunks in multilevel nesting setups (which right now is sitting within FindChunks).

All in all, it probably won’t be the top of the priority list for someone to fix. If you care enough, your best bet is to fix it in your own fork, submit a PR and eventually it may get merged into a mainline.

gchen · 15 March 2023 17:46

Lookup-before-upload is the central idea that made Lock-Free Deduplication possible and this is unlikely to change.

Icydog · 15 March 2023 17:59

@sevimo It doubles throughput for Dropbox because Dropbox rate-limits based on number of API calls, not based on bytes returned. A get_metadata call that returns 100 bytes is equally expensive as a download call that returns 4 MB since both contribute equally to the rate limit counter.

Yeah I understand that it would be a significant amount of work and may not even work for all storage types. Just wondering if it would even be possible or worth exploring.

@gchen This question is specifically for the download path, not the upload path. Would that be theoretically possible?

sevimo · 15 March 2023 18:25

I understand how API rate limiting works, it’s the same setup for ODB or GDrive. My point is because actual download/upload API calls need some time to complete, their rate won’t be as high (even at double the rate), and might not trip the limits per unit time. It is a bigger issue for check/prune commands where most of the API calls are trivial (locate/delete file), so it is easy to hit API throttling. I know that in my case I can saturate bandwidth on upload/download without throttling (unless I go with stupid number of threads), but check/prune can hit these limits easily with a few threads.

If Dropbox API throttling is more severe, it might be a bigger problem than I am used to.

Icydog · 15 March 2023 18:44

Dropbox throttling is pretty severe. It takes almost no time to download 4 MB on a gigabit pipe (32mbits/1000mbps = 0.032 seconds), which is probably comparable to the latency of a get_metadata call. I find that my chunk checking command is spending about half of its time backing off from HTTP 429’s. Sure, it may not be exactly double the throughput, but it would be a huge improvement that is probably close to 2x.