Any tips to boost google drive performance as much as possible?

Hello Duplicacy community, I have searched for tips regarding boosting performance to backup/restore towards google drive & not found much, so I’m making this thread to hopefully get some tips from other seasoned users that have managed it.

I use duplicacy web version on a Windows NAS to make local & cloud encrypted backup of my local NAS data. My choice of cloud is google workspace, because I happen to have a lot of space available there. Even if it’s often argued that it’s not the best cloud destination, I would like to make the best of it if possible.

In other separate usage (data that does not need versioning/deduplication and so on) I use rclone to push data to/from this google workspace and I saturate my gigabit connection with transfers easily, so I know that this is not a bandwidth problem towards Google. I have played with the ‘threads’ option and I manage to get it to transfer usually max 15MB/s (while rclone does 100+MB/s without a problem)

I thought this might have something to do with my chunk size? Duplicacy’s default is 4MiB as far as I know & sending loads of small files is slower than sending bigger files. Would it be better if I boosted this to a higher number? Would it make a difference?

Does it make any difference if the backup is encrypted or not?

Other than that I only know of -threads that I can use to influence the throughput. If anyone has some ideas to share with me I’d be happy to hear them.

Another thing I wanted to ask about is ‘copy-compatible’ - is there any disadvantage whatsoever to have my cloud backup ‘copy-compatible’ ? I think once I’ve found the ideal settings for my cloud backup, I figure I’ll just tick ‘copy-compatible’ & use the exact same settings for the local backup. Then it should always work interchangably to copy it to other local & cloud destinations right? I don’t see much benefit to using different encryption keys etc. for the different backup locations.

I found a topic here, where a user changed the chunk size apparently to 14MB and then achieved 100MB/s speeds.

I guess I have some testing to do.

This strongly depends on type of data you are backing up. For static data this does not matter - you upload it once and that’s it. For static data that gets added in large chunks (e.g. media) huge chunk size will help with practically no downsides. For data that experiences small distributed changes (like text documents, code bases, other databases) there is a point where gains from throughput due to large chunks are negated by the massive amount of extra data that needs to be uploaded and stored because of said too large of the chunks size. Whether that matters of course also depends on how much of that type of data you have and how much of it changes. If too much — this can matter even if storage itself is unlimited, like Google drive.

From my limited personal experience backing up my entire iCloud folder (about 1TB) to Storj with average 32MB chunk size I did not notice any excessive space usage over last year. I don’t have 4MB chunk size based history of the same data to compare, but the size on the storage looks reasonable.

I’m pretty sure it must be power of two.

Yes, it’s possible that user was talking about number of threads and not chunk size, not sure.

Anyway, so that’s worth a shot for me to try.

I have been experimenting & taught myself to use the CLI & found the ‘benchmark’ feature in the CLI which was very helpful, so I have been testing towards Google Drive with many different settings, tinkering with chunk size & threads, here are some preliminary finding:

The benchmark uses 256.00M files to split into chunks as standard. I’ve monitored my NAS & no resources seem to have been capped.

In the benchmarks I consistently for some reason get poorer upload speeds than download speeds. Why this is, I don’t know. I seem to hit a cap of 4-5 threads for upload, where if I increase it more it won’t provide more speed, even less. Chunk-size has little to no influence in this with standard value (4M) or higher.

For upload, the best upload values I’ve hit are around 10-15MB/s, sometimes 20MB/s but it’s inconsistent.

For download, a threads value of 12-16 seems to be good, where higher values provide diminishing returns & not much added throughput. 35-38MB/s is about as good as it gets (pretty good and acceptable to me.)

chunk-size:
I know setting this depends on your own library file sizes, so it can be useful to first analyze what kind of filesizes you’re dealing with. I’m not going to use the fixed chunk size, as I have many smaller files.

1M. (Just to test) - provided 10-12MB/s download

4M which is the standard value provides acceptable throughput of 20MB/s in download

8M - 25MB/s

16M would let me hit the 35-38MB/s download

Higher chunk-sizes gets slightly better, but diminishing returns.

I have decided to land on the chunk-size 32M - then the smallest chunks will be 8M and the largest will be 128M (funnily enough, I see 128M is the chunk value used in my rclone setup too)

In summary:
YMMY, this is just what I concluded from my benchmarking:

For upload - threads 4-5, more doesn’t help.
For download - threads 12-16, dimishing returns from more.
chunk-size 32M (min. 8M, max 128M)

I am using the shared google project .json

Does anybody know if there is any benefit to upload using your own project .json? I am sure the queries per user quota is reasonable on the shared project.

I think the default nowadays is 2,400 queries per minute per user per Google Cloud project - not sure what the parameters are for the Duplicacy shared project.

But if that was the issue, I imagine it would rate limit the downloads as well, which it does not…

Run the same benchmark to some object storage, like AWS. This would help determine if this is Google Drive limitation.

10-15MB/s, however, is not bad. You can only upload 750GB/day to Google Drive anyway, so there is little reason to go faster than 8.5 MB/s.

Yes I agree, I’m actually more interested in achieving as good restore/download speed as I can, it’s much more important to be able to get the restore data in a reasonable time than this initial upload of my data.

@saspus do you know if it’s inbuilt to add a custom own project .json to initiate a google drive repository in Duplicacy now? I read some older forum threads that it had to be compiled into the binary etc.

Or does it work out of the box now, or does the .json need some modifying anyway?

edit: I have tried as an experiment to rclone copy the generated chunks (many GBs sample) from gdrive to file system & vice versa. I consistently achieve 100+MB/s without a problem, both uploading & downloading.

At least this tells me that if I want to move or copy my Duplicacy storage somewhere else once it’s done uploading to gdrive, I can do this very fast with rclone.

But I have been reading the forums that if I do a Duplicacy copy instead, I get the chunks checked as well and if I want to be able to copy the storage it must be bit-identical

I’m pretty sure that change has been merged. 90% sure. You can try (I stopped using Google workspace long time ago so I’m a bit hazy on what worked, but I’m not sure anymore. This is my post from the past, where I was configuring duplicacy with the service account Duplicacy backup to Google Drive with Service Account | Trinkets, Odds, and Ends

For service account json still needs modification per above.

Yes. To be able to copy with third party tools target storage needs to be initialized as bit-identical to the source storage.
To copy with duplicacy (which provides finer granularity) it needs to be initialized as copy-compatible with the source storage.

I would not worry about speed of restore. In catastrophic failure scenario you likely don’t need all ten terabytes of data all at once. You can restore small working set and the rest in the background. Expected scenario is that you never need to restore, so optimizing for something that is expected to never happen seems counterproductive.

Thanks for the info, I might try with a custom .json later. I don’t really need to use a service account, can use the full account. But if that doesn’t work I’ll follow the guide carefully.

I want to ask about this bit-identical stuff. If I copied the finished storage directly from gdrive to a local disc with rclone and THEN started using duplicacy for future copying to the local disc, would that work? Would that be the same as a copy-compatible bit-identical storage?

Or do I need to follow some steps to ‘prepare’ the local disc storage to be both bit-identical & copy-compatible ?

Yes.

To be clear, the -bit-identical flag’s main purpose is for when you want to regularly ‘sync’ chunks between two storages, using a third party tool. If you’re only using Duplicacy, it’s fine to have a copy-compatible, non--bit-identical storage.

By manually copying an entire storage, you’re effectively making a bit-identical storage by virtue that the config file has the same keys.

On the subject of this post, keep in mind the reason for throughput limits is that Duplicacy is effectively re-packing every chunk - i.e. unencrypting, decompressing, verifying the hash, recompressing, encrypting - each and every chunk. Even with multiple threads, you’re basically bottlenecked by CPU more than bandwidth.

Incidentally, this is why it’s a good idea to practice 3-2-1, so you have at least two storages - one local, one remote - in which one can ‘repair’ the other from another. Whether you choose -bit-identical or not it doesn’t really matter.

But just to illustrate a nice feature of the way chunking works… if you planned to make a local copy, you can ‘pre-seed’ a local storage by making it copy-compatible, doing a backup (with temp IDs) to populate chunks. Then you can use Duplicacy to copy the cloud to local to fill in the gaps for historic snapshots (or Rclone/rsync in the case of -bit-identical). Handy if you’re bandwidth impaired, although in your case, it’d probably be faster to Rclone a local copy first, then check -chunks to be sure all’s good.

Ok sounds to me like there aren’t really any downsides to bit-identical, but it’s not really necessary. If I ever want to pull down the storage from gdrive or transfer it somewhere else I’ll just make a new storage at that time (so it’s automatically bit-identical copy-compatible at that time anyway.) And I think I’d prefer that Duplicacy takes care of the local storage rather than rclone & I want to make use of the extra functionality.

I think I’m going to go with:
→ backup data to gdrive
→ copy gdrive storage copy-compatible to the local disc

Good idea? Then it always uses the reliable storage of gdrive as a reference.

How shall I go about doing this, will the GUI sort it out? If I make a new Storage in the GUI that’s copy-compatible to my gdrive to local disc, does it automatically set the chunk-size to 32M like I’ve done custom for my gdrive? If I use the exact same encryption password for the storage is that the same as bit-identical or will it make some other random keys?

Thanks for answering my questions y’all. :smiley:

It’s less efficient (have to upload then download), but it’ll work. In fact, I mainly do the opposite (backup to local → copy local to remote) although very recently did the opposite as you describe for a subset of data I have backed up on an external drive. So both is fine.

Yes, you can do that. All copy-compatible storages have the same chunk sizes by definition (that’s what makes it’s copy-compatible).

Or you can just copy the config file across and initialise on that; that makes it copy-compatible and bit-identical.

If you do the latter, the password will be the same (you can always change it later with the password cmd), or you can specify a different password when adding the new storage in the GUI (or with the add -copy -bit-identical) command.

You can’t really go wrong with either method.

Fair point, actually since I want my live data to live in google drive from now on it’s going to be:
→ Download (hopefully from Google File Stream or Rclone Mount) → upload to seperate gdrive → Download/Copy to local storage.

I agree it would be more efficient to backup local first & then copy to gdrive.

With my own project .json credentials, I decided to make some quick tests to the capped values I discovered earlier for download for upload instead. This was just backup tests, I might look at using the benchmark tool again.

Average 32MB chunks.

4-8 threads upload:
14MB/s

16 threads upload:
25MB/s

20 threads upload:
31MB/s

24 threads upload:
14MB/s (rate limiting I suppose…)

30 threads upload:
FAILED. And this message is in log:
http2: server sent GOAWAY and closed the connection; LastStreamID=, ErrCode=ENHANCE_YOUR_CALM, debug=“dos_requested_goaway”
:sweat_smile:

Went back to 20 threads, probably rate limited again because I get 15MB/s.

I will report back after some testing, but it seems whenever it is rate limiting now (it could also have something to do with the particular chunks of that upload) with my own project token, it’s to a rate of 13-14MB/s. Completely acceptable to me.

The rate limiting seems to happen always just as the upload begins, so it’s at that particular point I’ll see if I can manage the max throughput of 20 threads or if it will rate limit.

EDIT:
8 threads
26 MB/s.

Seems a fine value to settle on, since I might have 2 backups at the same time running to this gdrive.

So with my own project token & on a more powerful server, I tried different parameters. Highest upload I’ve achieved has still been about 30MB/s, that seems to be a limit at least in my experience no matter what kind of chunks/threads combo I am using.

But I have been able to download at over 80MB/s with 16 threads 32MB average chunks.

In conclusion, the values I found earlier still work well & I am sticking to those.