Best thread count for SFTP?

SkyLinx · 6 November 2023 19:43

Hi, I want to use two SFTP locations for my backups (Hetzner Storage Box and a similar offering by NetDynamics). What is the optimal thread count for this kind of storage?

Thanks

saspus · 6 November 2023 21:23

It depends on many factors, there is no universal answer. Start with 1, and keep increasing, until you no longer see improvement.

SkyLinx · 6 November 2023 21:54

I thought there was a sort of rule of thumb. I have set the thread count to 4 for the initial backup. Thanks!

saspus · 6 November 2023 22:55

It all depends on the implementation of the target SFTP server, power of your local CPU, target CPU, storage type, etc. Too many variables.

For example, if the target SFTP server is running on a low power arm processor with limited amount of ram and a storage system base don HDDs without cache – anything more than 1 thread will be counter productive, as the seek time will decimate the throughput.

If the target is commercial datacenter – maybe 100 threads will work great too? Or maybe you’ll be hitting CPU encryption/decryption performance caps. Or network throughput? Or maybe switching to S3 will provide better throughput?

Instead of analyzing all that – finding best parameters empirically is quicker and much better.

SkyLinx · 7 November 2023 10:31

Thanks. I think I will use s3 compatible storage instead. I am testing with iDrive e2 and Backblaze B2 and backups and restores are a lot faster.

saspus · 7 November 2023 14:57

I would not use iDrive as a backup destination. Search this forum for details.

Backblaze, Wasabi, and Storj are fine. (Out of those three I recommend Storj, but do increase chunk size to 32-64)

SkyLinx · 7 November 2023 16:24

Thanks. I signed up for Storj but how to I change the chunk size? Is there a complete reference for all the CLI arguments somewhere?

SkyLinx · 7 November 2023 16:29

Sorry I meant the web GUI, not CLI. What difference does it make if I leave the chunk size to the default?

saspus · 7 November 2023 16:49

Cost of storage will be higher and performance lower.

Storj charges $0.0000088 per-segment fee. Segment is a blob of up to 64MiB. If file is smaller — it still takes up one segment.

Each file transfer has fixed perform overhead — so you want to prefer larger chunks. This one is not specific to storj.

So, you want to be as close to 64 as possible, to reduce number of chunks. For example, you could set min chunk size 16 and max 64. Or set fixed chunks size to 64. Depends on what kind of data you are backing up: for example, media and virtual machines will definitely benefit from fixed 64Mb chunk size.

While you do want to get close to 64 but you also don’t want to end up with excessive amount of large chunks unnecessarily if your data is high turnover small file sizes.

Chunk size is a power of two, so you don’t really have many options. 16, 32, or 64.

If don’t know if you can init storage with non-default chunk size in the web ui.

I would init in with CLI and the then add that storage to web ui.

Another consideration with storj. If your router and modem are not capable enough, using native storj integration may knock your internet down :). In this case you can switch to using their S3 gateway. This is also useful if you have limited upstream bandwidth: native integration can provide significantly higher performance but at the expense of 2.7x more upload traffic and massive number of connections that your network equipment may not be able to handle.

For example, my home Xfinity cable modem dies if I upload using 20 threads with native endpoint. On a fiber connection in the office using Cisco equipment it saturates my gigabit stream.

SkyLinx · 7 November 2023 16:55

Is the chunk size defined in megabytes? So a chunk size of 64M means that even tiny files take 64M of storage or that many of those are packed inside a chunk of 64M?

While waiting for your reply on the chunk size I tried creating a backup out of curiosity and it failed with various errors like

ERROR UPLOAD_CHUNK Failed to upload the chunk 93075175c2661a8da5fa3134849b93a9df6b4360e1e266b8306ebf3d833de450: uplink: stream: open /var/folders/_g/sf970v_n2m12s5_ygycxnkrm0000gn/T/tee564610016: too many open files
ERROR UPLOAD_CHUNK Failed to upload the chunk 44a4f47b4256cee2712fd6390e6ec2cdbe9409bc2e31f943753f57f0f2099469: uplink: stream: ecclient: successful puts (0) less than or equal to repair threshold (35), ecclient: failed to dial (node:12qkZDRYcmxaWCcJsS9t6VqtcnSdMczWbDgr3cTpYC6SF3sscqy): piecestore: rpc: tcp connector failed: rpc: dial tcp 185.101.25.78:28977: socket: too many open files; ecclient: failed to dial (node:1p5gNuNz3h5JbUwm25UjkLgVFfXUivZpXirgHXZhNYbtgP2L7y): piecestore: rpc: tcp connector failed: rpc: dial tcp 203.109.232.244:28967: socket: too many open files; ecclient: failed to dial

Lots of them, so first impression isn’t great… On the other hand I have used iDrive e2 since the beginning and apart from a performance issue the first few days which was fixed quickly, it has been rock solid for me. Both fast and reliable. I have been using it for both backups and media uploads for my apps. I can’t find a recent thread in this forum about problems with iDrive e2, only something from when it was launched, so perhaps those problems were initial problems with a very new service?

SkyLinx · 7 November 2023 17:01

Do you mind suggesting the CLI command to initialise Storj correctly? I am reading various pages but I want to be sure I do it right.

saspus · 7 November 2023 17:08

Sure. Give e few mins, I’ll get back on both questions

SkyLinx · 7 November 2023 17:27

Can you also recommend chunk size for Backblaze, since I will be using it as secondary for the COPY task?

SkyLinx · 7 November 2023 17:43

I forgot to reply to this part. I have 1000/400 broadband at home, and 1000/1000 on the server. Which method is best in this case? Also it is possible to change it later without restarting from scratch?

SkyLinx · 7 November 2023 19:32

I have tried with 32 avg chunk size, 16 min and 64 max, 20 threads and it’s significantly slower than when it was backing up to idrive with default chunk settings. And it always fails due to max files open despite I have increased the limits in macOS by a lot.

joe · 7 November 2023 23:41

Yeah Duplicacy doesn’t natively work great with Storj. It’s a big miss in my opinion – I also use Arq backup and when selecting Storj in Arq it automatically sets the proper chunk size and no need to worry about fiddling with anything.

SkyLinx · 8 November 2023 00:22

In the meantime I created an account with Wasabi and am using that as primary and iDrive as secondary. I couldn’t get Storj to work despite many attempts.

saspus · 8 November 2023 00:57

SkyLinx:

ERROR UPLOAD_CHUNK Failed to upload the chunk 93075175c2661a8da5fa3134849b93a9df6b4360e1e266b8306ebf3d833de450: uplink: stream: open /var/folders/_g/sf970v_n2m12s5_ygycxnkrm0000gn/T/tee564610016: too many open files
ERROR UPLOAD_CHUNK Failed to upload the chunk 44a4f47b4256cee2712fd6390e6ec2cdbe9409bc2e31f943753f57f0f2099469: uplink: stream: ecclient: successful puts (0) less than or equal to repair threshold (35), ecclient: failed to dial (node:12qkZDRYcmxaWCcJsS9t6VqtcnSdMczWbDgr3cTpYC6SF3sscqy): piecestore: rpc: tcp connector failed: rpc: dial tcp 185.101.25.78:28977: socket: too many open files; ecclient: failed to dial (node:1p5gNuNz3h5JbUwm25UjkLgVFfXUivZpXirgHXZhNYbtgP2L7y): piecestore: rpc: tcp connector failed: rpc: dial tcp 203.109.232.244:28967: socket: too many open files; ecclient: failed to dial

Lots of them, so first impression isn’t great… On the other hand I have used iDrive e2 since the begi

This is not a bug, have a look at this:

When uploading we start sending 110 erasure-coded pieces per segment in parallel out to the world, but stop at 80 pieces. This has the same effect as above in eliminating slow nodes but attempting more connections than are required to reconstitute the file.

Then you have 20 parallel connections in duplicacy, so you end up with 2200 sockets. This exceeds default limits.

Upload will work. The concern is longevity and data integrity. You can just google their name, tons of issues reported on various forums; and what’s more important – helpless support.

Look at this thread as well: Are you happy using iDrive e2 with Duplicacy - #8 by saspus

Sure, just add -chunk-size 64 to init command. or -max 64 -min 16, that might work better.

Backkblaze does not charge per-segment, so default 4MiB is fine. But you will get slihgher better performacfne with larger chunks; if you backup mostly media – crank it up!

Yes, you can change how you connect to the bucket any time. As long as duplicate has access to files it does not care about the exact transport. You can later copy all data to a local disk if you want and duplicacy will read that datastore just fine.

That means something in your environment is a bottleneck.

Try with S3 instead.

Pros:

Save upstream bandwidth 2.7x
110 times fewer connections needed

Cons:

if you are far from gateway performance may be worse; with native integration you get awesome performance from anywhere; but since you don’t in your environment (which is not uncommon for home/small business connections) – using S3 may still be better overall.

Perhaps it’s running out of something else. I haven’t tried on macOS, I’m running it on FreeBSD, and it’s pretty solid; the issue is with my modem, not the OS.

Ideally, duplicacy shall have different defaults per remote. But it does not.

Have you tried using s3 gateway? That’s what I’m using from home. For the same reasons – my ISP provided modem can’t handle multiple connections of the native backend.

At your storj account create S3 credentials for your bucket, you will get API key, Secret, and endpoint https://gateway.storjshare.io. Then use Amazon S3 backend here: Storage Backends · gilbertchen/duplicacy Wiki · GitHub

BTW Arq also uses storj via S3, which is a sensible choice, given most customers are on residential connections.

Wasabi is fine, just be aware of limitations:

3 month minimum retention,
1TB minimim charge
You can egress maximum amount of data equal to amount stored. And they can ban your account if you exceed that.

With iDrive – I woull not even bother. Tons of issues, misaligned incentives; see that thread above.

Backblaze may be OK – but they also managed to allow mishaps in the past that were unacceptable at the present stage of product maturity (like API returning bad data).

There are not many reliable alternatives. You can pay for AWS, but duplicacy does not support archival tiers; nearline storage is quite expensive, and if you want Muti-region – then multiple by number or regions. And you pay for transfers between regions.

My general advice – stay away from those small companies running on a thin margins cutting corners everywhere. They OK for data sync, but not for long term backup.

I like storj because data integrity is in the foundation of its design – it can’t return bad data, and it’s decentralized by design. So I trust it. It also happens to be ridiculously cheap for some reason for all the features offered.

SkyLinx · 8 November 2023 08:51

Thanks @saspus, I will try again this time with the s3 gateway. Did I understand it correctly, from your previous messages in other threads, that I don’t need to enable encryption with Duplicacy when using it with Storj because it’s already encrypted?

SkyLinx · 8 November 2023 09:05

Which region do I use with the s3 gateway?