Sorry I meant the web GUI, not CLI. What difference does it make if I leave the chunk size to the default?
Cost of storage will be higher and performance lower.
Storj charges $0.0000088 per-segment fee. Segment is a blob of up to 64MiB. If file is smaller — it still takes up one segment.
Each file transfer has fixed perform overhead — so you want to prefer larger chunks. This one is not specific to storj.
So, you want to be as close to 64 as possible, to reduce number of chunks. For example, you could set min chunk size 16 and max 64. Or set fixed chunks size to 64. Depends on what kind of data you are backing up: for example, media and virtual machines will definitely benefit from fixed 64Mb chunk size.
While you do want to get close to 64 but you also don’t want to end up with excessive amount of large chunks unnecessarily if your data is high turnover small file sizes.
Chunk size is a power of two, so you don’t really have many options. 16, 32, or 64.
If don’t know if you can init storage with non-default chunk size in the web ui.
I would init in with CLI and the then add that storage to web ui.
Another consideration with storj. If your router and modem are not capable enough, using native storj integration may knock your internet down :). In this case you can switch to using their S3 gateway. This is also useful if you have limited upstream bandwidth: native integration can provide significantly higher performance but at the expense of 2.7x more upload traffic and massive number of connections that your network equipment may not be able to handle.
For example, my home Xfinity cable modem dies if I upload using 20 threads with native endpoint. On a fiber connection in the office using Cisco equipment it saturates my gigabit stream.
Is the chunk size defined in megabytes? So a chunk size of 64M means that even tiny files take 64M of storage or that many of those are packed inside a chunk of 64M?
While waiting for your reply on the chunk size I tried creating a backup out of curiosity and it failed with various errors like
ERROR UPLOAD_CHUNK Failed to upload the chunk 93075175c2661a8da5fa3134849b93a9df6b4360e1e266b8306ebf3d833de450: uplink: stream: open /var/folders/_g/sf970v_n2m12s5_ygycxnkrm0000gn/T/tee564610016: too many open files
ERROR UPLOAD_CHUNK Failed to upload the chunk 44a4f47b4256cee2712fd6390e6ec2cdbe9409bc2e31f943753f57f0f2099469: uplink: stream: ecclient: successful puts (0) less than or equal to repair threshold (35), ecclient: failed to dial (node:12qkZDRYcmxaWCcJsS9t6VqtcnSdMczWbDgr3cTpYC6SF3sscqy): piecestore: rpc: tcp connector failed: rpc: dial tcp 185.101.25.78:28977: socket: too many open files; ecclient: failed to dial (node:1p5gNuNz3h5JbUwm25UjkLgVFfXUivZpXirgHXZhNYbtgP2L7y): piecestore: rpc: tcp connector failed: rpc: dial tcp 203.109.232.244:28967: socket: too many open files; ecclient: failed to dial
Lots of them, so first impression isn’t great… On the other hand I have used iDrive e2 since the beginning and apart from a performance issue the first few days which was fixed quickly, it has been rock solid for me. Both fast and reliable. I have been using it for both backups and media uploads for my apps. I can’t find a recent thread in this forum about problems with iDrive e2, only something from when it was launched, so perhaps those problems were initial problems with a very new service?
Do you mind suggesting the CLI command to initialise Storj correctly? I am reading various pages but I want to be sure I do it right.
Sure. Give e few mins, I’ll get back on both questions
Can you also recommend chunk size for Backblaze, since I will be using it as secondary for the COPY task?
I forgot to reply to this part. I have 1000/400 broadband at home, and 1000/1000 on the server. Which method is best in this case? Also it is possible to change it later without restarting from scratch?
I have tried with 32 avg chunk size, 16 min and 64 max, 20 threads and it’s significantly slower than when it was backing up to idrive with default chunk settings. And it always fails due to max files open despite I have increased the limits in macOS by a lot.
Yeah Duplicacy doesn’t natively work great with Storj. It’s a big miss in my opinion – I also use Arq backup and when selecting Storj in Arq it automatically sets the proper chunk size and no need to worry about fiddling with anything.
In the meantime I created an account with Wasabi and am using that as primary and iDrive as secondary. I couldn’t get Storj to work despite many attempts.
This is not a bug, have a look at this:
When uploading we start sending 110 erasure-coded pieces per segment in parallel out to the world, but stop at 80 pieces. This has the same effect as above in eliminating slow nodes but attempting more connections than are required to reconstitute the file.
Then you have 20 parallel connections in duplicacy, so you end up with 2200 sockets. This exceeds default limits.
Upload will work. The concern is longevity and data integrity. You can just google their name, tons of issues reported on various forums; and what’s more important – helpless support.
Look at this thread as well: Are you happy using iDrive e2 with Duplicacy - #8 by saspus
Sure, just add -chunk-size 64
to init command. or -max 64 -min 16
, that might work better.
Backkblaze does not charge per-segment, so default 4MiB is fine. But you will get slihgher better performacfne with larger chunks; if you backup mostly media – crank it up!
Yes, you can change how you connect to the bucket any time. As long as duplicate has access to files it does not care about the exact transport. You can later copy all data to a local disk if you want and duplicacy will read that datastore just fine.
That means something in your environment is a bottleneck.
Try with S3 instead.
Pros:
- Save upstream bandwidth 2.7x
- 110 times fewer connections needed
Cons:
- if you are far from gateway performance may be worse; with native integration you get awesome performance from anywhere; but since you don’t in your environment (which is not uncommon for home/small business connections) – using S3 may still be better overall.
Perhaps it’s running out of something else. I haven’t tried on macOS, I’m running it on FreeBSD, and it’s pretty solid; the issue is with my modem, not the OS.
Ideally, duplicacy shall have different defaults per remote. But it does not.
Have you tried using s3 gateway? That’s what I’m using from home. For the same reasons – my ISP provided modem can’t handle multiple connections of the native backend.
At your storj account create S3 credentials for your bucket, you will get API key, Secret, and endpoint https://gateway.storjshare.io. Then use Amazon S3 backend here: Storage Backends · gilbertchen/duplicacy Wiki · GitHub
BTW Arq also uses storj via S3, which is a sensible choice, given most customers are on residential connections.
Wasabi is fine, just be aware of limitations:
- 3 month minimum retention,
- 1TB minimim charge
- You can egress maximum amount of data equal to amount stored. And they can ban your account if you exceed that.
With iDrive – I woull not even bother. Tons of issues, misaligned incentives; see that thread above.
Backblaze may be OK – but they also managed to allow mishaps in the past that were unacceptable at the present stage of product maturity (like API returning bad data).
There are not many reliable alternatives. You can pay for AWS, but duplicacy does not support archival tiers; nearline storage is quite expensive, and if you want Muti-region – then multiple by number or regions. And you pay for transfers between regions.
My general advice – stay away from those small companies running on a thin margins cutting corners everywhere. They OK for data sync, but not for long term backup.
I like storj because data integrity is in the foundation of its design – it can’t return bad data, and it’s decentralized by design. So I trust it. It also happens to be ridiculously cheap for some reason for all the features offered.
Thanks @saspus, I will try again this time with the s3 gateway. Did I understand it correctly, from your previous messages in other threads, that I don’t need to enable encryption with Duplicacy when using it with Storj because it’s already encrypted?
Which region do I use with the s3 gateway?
You don’t have to enable encryption in duplicacy, if you use storj native integration — it’s end to end encrypted.
When using gateway however - gateway has encryption keys by necessity, so this is no longer end to end encrypted. I would continue using duplicacy encryption for consistency — maybe you’ll want to move duplicacy datastore elsewhere in the future.
There is no concept of regions with storj, but duplicacy wants one. You can specify any, e.g. I tend to put “us-east-1”. It gets ignored anyway.
It worked with “global” as region. It’s now backing up without errors, let’s see how it goes. Thanks for all the info so far!
BTW I set thread count to 20 and it’s backing up at 18 MB/s. How can I speed it up? What thread count should I use? I have 400 Mbps upload and with both iDrive and Wasabi it was backing up at almost full speed, but not with Storj.
Did you increase chunk size or is it the default?
Storj will have more latency, as each transfer will need to be uploaded to gateway, then split to shards, erasure coded and distributed to geographically uncorrelated nodes, and only then return confirmation (i.e. do everything native integration would have done locally). Conventional provider don’t have that overhead — they can just synchronously accept the whole file and write to the same datacenter.
You can keep increasing thread count until performance no longer increases.
It’s an interesting trade off. It’s either very fast with native one, or very horrible, and s3 gateway is somewhere in the middle.
I’ve actually experimented with running my own gateway on a cloud instance — I got excellent results, but the cloud instance needed to be quite beefy. I’ll try to find that thread.
I set -min-chunk-size 16 -max-chunk-size 64M
- is that what you said?
I got excellent results, but the cloud instance needed to be quite beefy. I’ll try to find that thread.
I am using a Mac mini M2 Pro which is quite powerful, so it’s not a problem with my computer.
Found it. Completely new: where to start - #61 by saspus
I understand; what I mean is that when you use s3 gateway all that work is done on the gateway, not on your machine. I guess if their capacity is limited maybe that contributes to lower performance. To rule that out i tried running native integration on Amazon instance directly (what gateway would accomplish).
Plus latency if you are far from the gateway negates the benefits of distributed nature of the network.
I’m still not sure what went wrong with your local integration — you have quite a beefy internet and hardware. I have Mac m2 but my upstream is only 20Mbps at home, so I might not be able to replicate your result. But I’ll try tomorrow nevertheless. The running out of handles bit should not have happened; but my zshrc is quite extensive, I might have configured something and forgot. I’ll check.
Revisiting the native one. What did you set ulimit -n value to? Try something large, like 65536.