Completely new: where to start

saspus · 18 September 2023 06:02

Yes, that’s about right. GUI gets the simple jobs done and is probably enough for most people, but if you want finer control – CLI is easier, because you don’t have to fight the middle men, and figure out how to tell GUI what you want to tell CLI. In addition, for personal use CLI is free, so there is that.

Yes, sort of. Rclone is a sync tool (no vessioning, beyond rudimentary "copy different file to another folder), Duplicacy–a proper versioned, point-in-time, backup, with configurable version history. You need backup, so use duplicacy. But since most of your files were media – I suggested rclone as an alternative, since you don’ need versioning for those files.

But if you want to keep it simple – use duplicacy for everything, including media files. it will work just fine.

See, this is one of the things that CLI supports but is not exposed in GUI. Duplicacy has a duplicacy benchmark command where it can benchmark storage.

Unfortunately, this may not tell you the full picture. If you storage is empty, benchmark will upload a few test files and check performance, and it will be fast; but once your storage fills up, and the target contains hundreds of thousands of files, enumerating those file scan be very slow on OneDrive, as it was designed for a different usecase. Increasing average chunk size can help here too, by reducing the number of chunks required, at the expense of slightly worse deduplication. But it will be net benefit in the end.

I dislike wasabi for their predatory pricing. They claim it’s a “hot storage” at simple pricing but in reality there are gotchas:

You will pay for minimum 1TB regardless of how much you use
If you delete a file before storing it for 3 month you will pay a fee equal to cost of storing it for 3 month.
And if you download more than what you store, they can ban your account. Not a nice prospect.

It benefits from having started earlier.

Dont’ worry about crypto tokens. Storj runs a network of independent node operators located all over the world, who contribute underutilized resources on their servers and get compensated for doing so. Figuring out tax codes and international payments is insurmountable task, so they’ve created a utility token on etherium network to facilitate the payments: They calculate payment in USD, convert to tokens, and pay operators. Operators can convert tokens to their local currency and deal with local taxes themselves.

As a customer, you can pay for storage with a plain old credit card. But they also accept their own tokens, because it would have been silly if they did not

Can’t agree more. I’ve discovered storj few years ago, and using it for everything now, it’s awesome. It’s way underpriced for what it is. geo-redundant storage that behaves like a world-wide CDN at $4/TB !? come on! If you want geographic redundancy with B2 or amazon – you have to pretty much pay double.

Good point. I was a big Microsoft services aficionado in the past, but they keep going downhill since that office365 rebranding, so I’m no longer using nor recommending them. Google Docs/ Spreadsheets and Apple Pages/Numbers more than fullfill my needs (and I was quite a power user of Excel) so paying for office subscription when you can have same or better tools for free seems counterproductive.

Right. Their service talks their own protocol, but they also run separate servers that “bridge S3 to their own protocol” so customers that have software that already supports (the defacto standard) s3 protocol can easily switch to storj with their existing software.

The web UI is really useful just for small tasks, they are intended to be used via the storage API, by other apps, like Duplicacy, rclone, etc.

At each backup as a command option. If you put it to “global” options duplicacy won’t like it, if I remember it correctly.

Awesome!

Very good question.

Some filesystems have built-in checksumming support, and can verify if the file got corrupted, and even restore if from a redundancy information if that is available. Systems that run such filesystems often run a periodic task that reads every file, compares checksums and repairs any rot. This is usually feature of a NAS appliances. General purpose home machines, including Windows, macOS, etc, don’t have that capability, and you are absolutely right, if you keep an old photo on an HDD in a 7-10 years it will likely rot. On an SSD – even faster, especially if SSD is not kept powered.

Technical term is bit rot

This is a very valid concern that many overlook.

Duplicacy cannot distinguish file corruption vs deliberate change. So it will see - -oh, file changed, let me backup this new version – and indeed back it up as a new version. Now if you notice at some point in the future that oh, this photo got corrupted – you should be able restore the older, uncorrupted version from the backup. That’s where long version history comes handy and why for something to be called “backup” it has to have a long version history.

Keeping data for just 1800 days is not enough. It’s just a default, you can then manually edit the command line (the -keep options) to specify any other suitable retention. A common approach is to:

keep all backups made in the last two weeks
After two weeks, keep a version every day
After 90 days keep, a version every week
After one year keep a version every month

If data does not change – new versions don’t occupy any extra space. The keep parameters for the prune command to implement this scenario may look like so: -keep 31:360 -keep 7:90 -keep 1:14 -all

By the way, the same concern is valid with the respect to target storage. If, for example, you are backing up to a local USB HDD – it too, can rot. And corrupt the chunk data that duplciacy write, as a result you would not be able to restore some data.

Duplicacy offers a feature, called “erasure coding” where you can configure it to write chunk data with redundancy, to be able to tolerate occasional bit-flips, but it’s not a panacea. Ideally, you want to avoid backing up to the isolated hard drives, as they cannot provide data consistency guarantees. Cloud storage providers – usually do. Storj, for example, encrypts all data by default with the user-generated keys, so if data was to be corrupted it would just simply fail to decrypt; therefore by design it can’t return you bad data.

Someone suggested me a cardinally different approach for archiving family photos and videos – a low tech solution that involves Blue Ray M-Disk with stated 1000 years retention. That’s something to consider.

When you init the storage you have an option to encrypt it (see the -e flag in the init command. Then to both, do backups, and do restores, you would need to provide that encryption key. Without that key data just looks like noise and unreadable by anyone. (Duplicay can use keychain on macOS and Windows, so you don’t actually have to specify it every time of course)

For advanced scenarios duplicacy supports an interesting feature: RSA encryption. it allows you do generate a key pair, and duplicacy will use the public key for backup. To restore, however, you will need to provide private key. This is a useful scenario if you backup from multiple computers to the same storage, to take advantage of duplicacy’s cross-machine de-duplicaton (a major selling point) but don’t want users to be able to restore each others’ data.

edgardiego · 18 September 2023 16:24

I would say that it missess some important options even for someone that is starting, not being able to modify chunks size for example seems like an important parameter optimization wise. Being free would be and advantage if I had not already paid for the GUI version, which seems I´m not actually going to use .

As you say, I´ll probably use only duplicacy for simplicity´s sake, learning one CLI based program seems already complex enough, as someone new in this area I have to find out what each parameter does before breaking something. However, and I know I already asked this before, but I want to be sure I really understood; versioning could be useful even for media as a bit rot prevention policy.

Sunk cost fallacy once again

I guess that I wont benefit from geo-redundancy as a domestic user, however if it provides the same services for a lower cost it is indeed an easy choice.

You are completely right, I´m decided to leave office365 and go for STORJ + google suite, apple suite, or libreoffice instead (I´ll have to compare between them).

What is the purpose of global options then?

Seems like a kind of ¨brute force¨ approach, but easy to implement. The only way that I could think to actually have a control over this would be having two separate back ups and somehow compare between them periodically. I guess that big companies sure have this sorted out already.

I use Bitwarden for all my passwords, I assume then that by keeping that encryption key there I could recover my data even from another computer. Does this password have to comply with some specific requirements?

I´ve started to tinker a bit around with the CLI version to use it alongside STORJ, which so far is being an slow and confusing process for somene that has never used cmd on windows.
After reading some documentation and with the help that has been provided here I´ve come to the next conclusions.

I´ve renamed the exe to duplicacy.exe and placed it on C:/Program Files/Duplicacy, then added said exe to PATH using the Environmetal Variables editor.
I assume the next step would be to init the storage by going into the folder I want to backup and then: duplicacy init -e -c 32 storage_name storage_url
Then I would have to enter the encryption pass. I used 32 mb following you previous recommendations, since I believe that I cannot modify the average chunk size later. About the storage_name it would be wherever I want to name it, STORJ or whatever; about the URL I used the “Create Keys for CLI” on STORJ web.

Edit: after figuring out that I had to do it on Powershell instead of cmd and a few attempts I managed to init the storage, however I forgot to move to the directory I wanted to backup and did it on C: instead, is there a way to change this or do I have to start over? In case of having to start from scratch, what should I delete? I thought that set maybe could allow me to do it, but I don’t see any way to change the repository. I want to avoid C: being the repository assigned to default.

Then I would have to to use backup -vss (which I think I have to give root permisions somehow before), but I still have to check how it works.
Am I missing something important? What worries me the most is that some options are non-modificable, I rather go slowly than having to start all over again later.

Thanks again! Really appreciate such detailed answers.

saspus · 19 September 2023 04:40

Yes. Not rot prevention – because you can’t control that, it will happen sooner or later, but rot recovery. Once you discover that the specific file got corrupted – you can restore uncorrupted version of it.

It is not about where you are, but where your data is. When you are using conventional centralized provider, like B2 or Wasabi, you pick where (in which datacenter) will your data reside. Usually, based on the proximity to your geographically. But that’s just one datacenter. Even though there should be good security, sprinklers, and what not – flood, fire, earthquakes, and maybe even sabotage are still a thing. You dont’ want to lose your data in the unliky event if a datacenter burns in flames or is swallowed by San Andreas fault. So, many providers offer geo-redundancy, where they keep a copy of your data in more than one geographic locations. And charge you more for that.

Storj on the other hand, stores data distributed across the world by design. So you get this feature for free. If the entire continent sinks – your data will still be recoverable. They have a well written Whitepaper, have a look. It’s an interesting read.

Global options are these: Global Options · gilbertchen/duplicacy Wiki · GitHub. They control application behaviors. As opposed to command options, that control the specific commands behavior. Why there are separate edit boxes there in the UI - I don’t know. I’d think you should be able to just put them all into one box and let the app figure out which one is which. CLI does it anyway, so why ask the user to separate whites and colors and then plop them into the same CLI? No idea. Maybe some historic reasons.

Yep, that’s perfect. Nope, not requirements. It’s not a “password” per se, as much as is it an “encryption phrase”. I personally prefer to use Diceware so if I need to type it – I don’t have to struggle looking for special symbols. 1Password allows to generate such passkeys, perhaps bit warder does too.

Two things here. Duplciacy will create a hidden folder .duplicacy to keep its own config in the folder you call init in. Common approach is to init in the C:\Users folder, or c:\Users\you folder, and backup the whole content of Users or you.

Alternatively, you can specify where duplicacy shall create preferences directory with --pref-dir parameter to init call. This may be preferable, as you can then place it anywhere you want, for example C:\Users\you\APPDATA\Duplicacy would be a good location if you plan to backup only your user folder, or c:\ProgramData\Duplicacy if it is system wide.

You have two options here.

Use the storj:// protocol,
or usr s3:// protocol.

Since this is a home computer on the home network I suggest using s3. You can create s3 credentials on Storj web side, they will give you all the parameters you need to specify for the s3 connection, which you will configure following the format described in the Storage Backends · gilbertchen/duplicacy Wiki · GitHub under Amazon S3 section.

You would need to create a bucket first on there storj web site. There you have to choose an encryption key for your data. Since duplicacy encrypts its data anyway it’s kind of redundant, but storj is always end-to-end encrypted, so you have to pick an encryption passphrase. Create another diceware passkey. You will need to use it once to create S3 credentials. When using S3 gateway, the gateway has the encryption key, so it’s no longer end-to-end encrypted, but we don’t care – duplicacy encrypts data anyway.

Have a look at this – maybe some bits will be useful as a reference: How to use Duplicacy to backup securely to StorJ cloud storage | Curt Warfield

No need to start over. You can just move that hidden .duplicacy folder into the correct directory.

set allows to save variables like credentials and access keys to a file .duplcacy/preferences. Duplicacy should be able to save them to your windows keychain, so you probably won’ need to do it manually.

When you create a backup task in your Task Scheduler you would tick an option called “Run as administrator” or something like this. Or if you wan to run it manually in the terminal – run the terminal (or power shell, or cmd) with elevated permissions.

Nope, looks like everything is covered.

To sumarize

on storj satellite:

create bucket (Save passphrase)
create s3 credentials (provide passphrase, choose which bucket to allow access to and gives full access to that bucket)
Save s3 credentials.

On your PC, in the elevated command prompt (to use vss):

init duplicacy under C:\Users\ or c:\Users\you, or wherever else,
Provide the s3 connection parameters to the init string,
run duplicacy backup -vss

On your PC, in Task Scheduler,

follow the wizard to create a periodic backup task that would CD into the directory duplicacy was initialized in, and run duplciacy backup -vss periodically.

Then, once this is all working, you can configure another schedule to run duplicacy check – that ensures that all chunks it expects to be on the storage are indeed there, and prune – that will thin the backup history according to some plan discussed in one of the previous messages, so you don’t end up with thousands of revisions after 5 years of hourly backups :).

Edit. If you are going to init duplicacy under Users or Users\you, I would recommend configuring filters, to exclude some files or folders from backup. Stuff like Temp folders, Browser caches, etc. There is no reason to waste time and bandwidth and storage to backup those. Include Exclude Patterns · gilbertchen/duplicacy Wiki · GitHub

saspus · 19 September 2023 05:22

Missed that. Arq has two offerings. you can buy a program and use it whatever storage you want. You get to use it forever and get a year of updates. That’s what I use. I don’t pay for update every year. IN fact, I did not update for the past three years, and only updated few months back – because they added a feature I found useful (backup of ephemeral, cloud-only files).
Another option, you pay about the same amount of money annually, but get 5 licenses, and 1TB of included storage. You can buy more storage at 1TB increments. It’s an interesting proposition if you have many computers with not much data each.

This did not work for me – I prefer one license, and use cloud of my choice. I have one Mac, that syncs data for all users from iCloud, runs Arq, and backs it up to Amazon deep archive. All other family member’s Macs, phones, iPads, and what not simply keep all data in iCloud.

edgardiego · 19 September 2023 08:11

I was thinking of it as a way to improve performance for example in the case of a business with multiple subsidiaries in different continents. It is indeed something good to have in any case and I’ll try to take a look at that paper, I’m curious about some aspects of it.

Bitwarden allows to create pass without including special symbols, I went for an alpanumeric pass of around 20 digits, which may be a bit overkill.

About the rest I think I made a few mistakes, and after your explanation I would have to:

As I said I init the wrong directory, I should move the .duplicacy folder from C:/User/myuser to D:/backukfolder. I have 4 disks on my PC, firstly one SSD for SO and job related programs, a seconds SSD as a gaming library, and lastly an HDD RAID0 (D:) which I use to store data. This last one is the one I want to backup. Here I also have films and similar objetcs I don’t need backed up hence the reason for backing just one folder, In addition it simplifies the process avoiding the need for filters. By moving the .duplicacy folder all chunk size and passphrase would be the same? Will therefore D:/backukfolder be considerer the default storage?
I created the bucket, but I copied the credentials created by using the “API Key” option in the STORJ access area. I’m guessing that this would be the STORJ protocol credentials, as a result I would have the “S3 Credentials” option instead. From what I know duplicacy creates a document in the storage destination, so maybes would be recommendable that I delete the bucket and create a new one?

My first idea was to delete the bucket and the .duplicacy folders and just start over, but I wasn’t sure if it created config files elsewhere.

I think overlooked that option during the set up process.

I find that to be the most appealing option too. What I don’t get is which is the benefit of using it, I mean, It is appealing for someone like me that is not well versed in the use of CLI and want something easy to use, which is for sure not your case. What am I missing out?

saspus · 19 September 2023 17:49

Yes. Duplicacy runs entirely on the client, and the configuration files is all it needs to be able to connect to storage and know what to backup.

So I assume you connected duplicacy to your bucket using storj://… url string.

No, you don’t have to delete a bucket. You just need to create new S3 credentials, then open .duplicacy/prefernces files in a text editor and replace the storj://… url with the new s3://… url.

As long as duplicacy has access to storage it will work as if nothing happened. You can move that storage to another provider, local NAS, whatever, change the URL in preferences, or simply add a new storage with the new location and duplciacy will recognize is and continue as if nothing happened.

No need to delete anything. Of course, you can delete and start over, but why, when all you need is replace an URL

To answer your question, yes, .duplicacy folder is the only thing it writes stuff into, for better or worse.

You don’t need this, because on windows duplciacy can use system keychain and store credentials there. It’s more secure than having them plaintext in the config file.

For me it were three features that duplicacy is missing, but I hope will get sooner rather than later. I don’t care much about UI – but yes, Arq has a very good UI, so it’s a plus.

The features are as follows (most specific to macOS)

Support for Archival tier storage. I don’t want to pay for hot storage, I want to use Glacier Deep Archive for my backups. Duplicacy does not support this today.
Support for backing up user-mounted filesystems. For example, if you have multiple users logged in, and one of the users mounted the disk form the server with their credentials, and another user did the same with their own credentials. This can be a network filesystem, or virtual one, does not matter. I want to be able to backup those. Arq allows to do that using helper process impersonating each user and pumping data to the main process. With duplicacy this is impossible to achieve.
CPU throttling. I’m using a laptop, and often when it’s at 5% battery charge left, I can work for hours. But if duplicacy kicks in it runs at full throttle instantly sending laptop to sleep. Arq allows to configure amount of CPU usage.

These were dealbreakers for me, especially the first one. Otherwise duplicacy is perfect: it supports Time Machine exclusions, is very fast, has powerful, if with a bit steep learning curve exclusion engine, but it does support Time Machine exclusions.

The last one you can workaround but CPULimit utility. But the first two – need to be addressed in the app.

There is a fourth feature, that Arq implemented, that prompted me to pay for update, and I find it extremely useful. I filed a feature request for Duplicacy here: Support for dataless files

edgardiego · 19 September 2023 22:54

I ended up starting from scratch, I could not find in preferences a way to input the Access Key and the Secret Key. In addition, I had to work a bit around the URL, for some reason it required me to include a region, even though STORJ is by definition decentralized. I ended up using EU1 and It seemed to work.

My next step is to run the backup, where I’ll use -vss and a number of -threads I have yet to decide.
Regarding this aspect what I understand so far is that an increase of threads would cause a proportional increase of speed, till the top of my current bandwidth. Right know I’ve got a symmetric 1 Gbps net, maybe less since I use a VPN. What is the common approach to determine the optimum number?

I assume that this would be cheaper but with some restriction to retrieve the data, probably not a problem when talking about backups.

In my case this would not be a problem since I’m using a full tower PC, however I imagine that the use of resources could slow the PC while the backup is taking place? Is there a way to prevent this? It must be annoying if it happens while I’m doing some heavyload job or gaming.

To be honest, Arq7 seems like a good option for someone like me. It would have been a lot less painful at least
However, I’ve seen some negative comments in the forum, specially talking about the Windows version, so at this point I’ll probably just stick to Duplicacy.

Thanks again!

saspus · 20 September 2023 03:01

Oh yea. I have seen it with other services too, like Minio, you have to provide region even if the server ignores it.

There are two questions to consider. Do you actually need backup to go full speed? It shall be probably a background process that does not interfere with anything and slowly does its job.

To determine optimal number — keep increasing thread count until performance no longer increase. These cloud storage services are designed to work multithreaded. For example, Backblaze provides about 10Mbps per connection. So if you want to upload faster — you have to use multi be threads.

You can use duplicacy benchmark command to test various thread parameters quickly.

I did not realize you have fiber. You can actually try playing with native storj integration, Depending on where in the world you are it may actually provide you much better throughput, and your fiber box is less likely to succumb to many thousands connections storj client will be making. Note, with s3 when you say 10 threads — duplicacy will upload in 10 threads and there I’ll be about 10 TCP connections to s3 gateway.

With storj native integration each “connection” is parallelized internally by storj library (called uplink), so the transfers are inherently parallel. If you specify more threads on top — it multiples that effectively. This is where consumer cable modem fail. But your fiber box might survive. Worth a shot, if you are after absolute performance. Here are some measurements results: https://www.storj.io/documents/storj-univ-of-edinburgh-report.pdf

I doubt that. You can always set one thread or schedule backup at night. Windows Task Scheduler I think allows to program machine to wake, run the task, and then go back to sleep.

You should try it. There is one month trial.

From one hand, there are negative comments about everything, people never agree on anything.

On the other hand, Arq was Mac application and then ported to windows, so maybe there is some truth to it.

Also arq5 was horrible garbage — perhaps those comments relate to that? Arq7 is quite awesome.

But I agree, I think duplicacy will cover your usecase pretty well. It performs better , and free for personal use. So as long you don’t miss any features — it’s a clear winner.

edgardiego · 20 September 2023 17:21

Probably not, maybe just for the initial sync in order to avoid having the PC running for a few days.

Here in Spain is really weird to find cable modems anymore, maybe in really small towns (<500 habitants), almost all <7-10 years old installations are fiber. I thought it was kind of standard elsewhere too. However, I’ve got a really basic ZTE router provided by the ISP company, which I’m not sure if can be a limitation.

I’ll have to check that, I assume that for small changes using one thread can be more than enough without impacting performance, therefore avoiding starting the PC at night just to run a backup (probably not healthy hardware wise). I could manually launch a backup using more threads when a big pack of data is included.

As you suggested, I’ve been playing a bit with both protocols. One thing that confused me when setting the STORJ protocol up is that duplicacy asked for a passphrase (which I though was the encryption key) and after for a password for the storage. What is each one for? I actually think that the last is actually the encryption key, but did not find anything conclusive.

Probably my method is far from precise, but I’ll leave the results in case someone finds them useful:

From this I would say that for S3 the optimal would be to just use 5 threads, however I did not test if increasing a lot the number of threads above 5 would surpass the drop shown above it.
In the case of STORJ I can’t tell if the router got saturated after 20 or if it’s actually its ceiling. After 25 I tried with 18 and it gave worse results than 15, next attemp with 19 simply gave an error: Failed to upload the chunk: uplink: metaclient: rpc: tcp connector failed: rpc: context deadline exceeded.

My conclusion is that I could go either for S3 with 5 threads or go after full performance and use 20 threads with STORJ protocol. Any advice in this regard?

Really appreciate that you are taking the time to answer everything.

saspus · 20 September 2023 18:24

I live in California, the middle of a Silicon Valley, no less. We have Google Fiber, AT&T Fiber in parts of the area, but, of course, not where live specifically. I only get access to cable, and because it was designed for broadcast, the upstream is horrific by design. Hardware simply does not have enough transmit units. So I have 1000mps downstream (they offer up to 2000mbps) and 30mbps upstream. It’s the absolute max available. To their credit, they are upgrading network now to support up to 200Mbps upstream, using draft of new standard, as far as I understand, that allows modems to have configurable transievers that can be dynamically allocated between up and down stream, that some equipment vendors seem to already support, and it’s a huge deal. Can’t wait until they finish upgrading in my area. I use my own network equipment, Ubiquiti routers and access points, but the modem is ISP issued. That beast consumes 20W all the time doing practically nothing, and is not very stable. But it’s more stable than alternatives, so… I guess I don’t have much choice /rant.

The passphrase is likely to unlock the buckets. All buckets on storj are always encrypted. When you generate the S3 credentials, then the gateway holds the encryption key and gives you S3 credentials in stead. But if you use native storj client – then you fully control the key.

This is very nice! (Btw, discourse forum supports pasting images right into the body of the message. I.e. you take a screenshot into a clipboard and just paste it into the message editor. The forum will upload the image and insert markdown link.)

Few thoughts:

Did you measure this with benchmark command?
Was it with the default chunk size? It would be interesting to see comparison with default vs 32Mb vs vs 64Mb chunk size. I’m wondering if you specify max chunk size == min chunk size == 64Mb, will duplicacy actually create 64Mb files, or will it overshoot slightly? Storj segment size is 64Mb, so the closer you get to that the better will be performance (and slightly cheaper storage – per-segment fee is very small but non-zero). You can provide chunk sizes to the benchmark command.
I don’t remember if benchmark command allows to specify how much data to upload – because your connection is fast, and if the amount of data is quite small, the transfer time may have high variance.
It’s not entirely clear why is there such as sharp drop with s3 after 5 threads. It’s rather weird.

Yeah, that’s your network equipment having a stroke :). You can start ping in the new terminal window to google.com, and that would start failing as well.

I suspect that higher chunk size should provide much better performance with native storage integration even with smaller thread count. It feels like majority of time is spent setting up the connection, and not actually transferring chinks.

Storj has been working on implementing TCP FastOpen to shorten the handshake, but I’m not sure what’s the state of affairs on windows. I think today it’s only working on linux. (I’m using FreeBSD and it’s not implemented there either).

It feels from the numbers that S3 performance is hindered by something – 6Mbps at 5 threads is quite low. Perhaps it’s a combination of small chunk size exacerbated by high latency to the gateway. Can you ping gateway to measure latency?
If the gateway is “far” – then it’s another reason to perhaps use native integration. Especially since you are already getting better throughput. Increasing chunk size should let you achieve similar speed with much less than 20 threads. But if this is measured with already large chunk size – then I"m not sure what’s going on.

I’ll try to measure at home again to compare.

edgardiego · 20 September 2023 19:35

I’m very surprised to hear that to be honest. My parents live in a town of less of 3000 inhabitants and they also have 1 Gbps, symmetric for upload and download. On the other hand, I live in Madrid, where I could get up to 10 Gbps if I wanted, I’ve got 1Gbps for 30 bucks a month, which is more than I need to be honest. I would expect things to be much more advanced in the states. I’m glad that at least they are upgrading the network.

And STORJ passphrase is the same you set up on the web when creating the bucket? The one it asks if you want to be the same as the project or not.

I know! I just did it that way since I did not see any “spoiler” option or something like that. Next time I’ll probably just paste it.

Correct, this morning I did the S3 measurements, varying the -upload-threads and -download-threads. In the afternoon I did the same for the STORJ protocol.

I did not apply further instructions regarding chunk size. The storage init was with an average chunk size of 32 as you recommended, nothing else.

How could I know if it is overshooting? From the documentation I see that in the init command it can be added:

   -chunk-size, -c <size>          the average size of chunks (default is 4M)
   -max-chunk-size, -max <size>    the maximum size of chunks (default is chunk-size*4)
   -min-chunk-size, -min <size>    the minimum size of chunks (default is chunk-size/4)

Maybe with -max 64 and -min 64 it can be forced to always use 64 MB chunks.
In the case of the benchmark I think it could be done:

-chunk-count <count>         the number of chunks to upload and download (default to 64)
-chunk-size <size>           the size of chunks to upload and download (in MB, default to 4)

It allows it, but I left it by default, which is 256 Mb.

I believe so too After that I could barely browse on Google, so I turned off the router for about 10 seconds for it to restart. How could I benchmark it extensively without this problem? Should I have to restart it everytime?

ping gateway.storjshare.io reports 116 ms average. As said, I’m using a multi hop VPN which could being making it slower.

I could try making some measures if you think it can be interesting. Just let me know!

saspus · 20 September 2023 21:15

That’s a direct consequence of absolute vastness of the country and low population density – any infrastructure projects are ridiculously hard to finance. Including transportation. It’s quite a self-inflicted issue; San Jose, for example, the largest metro are in the South Bay, is zoned for single-family housing exclusively. It makes no sense. Youtube channel Not Just Bikes has a few videos on this topic.

I’m glad you asked!

What is encrypted on storj buckets are object. Files. The illusion that project, or bucket have an associated passphrase is just an illusion, for convenience managing buckets on the web. They don’t mean anything, they are just defaults.

They overhauled UI recently, because before it was even more confusing – it was making much stronger impression that it’s the bucket that is encrypted, and to see files you can decrypt the bucket. This is not true.

You can, in fact, have multiple files encrypted with different passphrase in the same bucket. Then when you list the bucket – you will only see files that can be decrypted by the provided passphrase. Or none, if passphrase is wrong. This is a quite a useful feature: for example, if you are developing an app that stores user data – you can use single bucket and dump data from all users into it encrypted with user-specific passwords and all the security separation is handled automagically.

So answering your quesiuton – passphrase can be anything. I usually just put random garbage when I create bucket – because I’m not going to use that passphrase. I then go generate credentials, and it asks for passphrase again – that’s the one I’m going to create and save in my password manager. Because that’s the encryption keys that will be needed to access files uploaded using that credentials.

Does it make sense?

I don’t think it’s in the menu – but this works:

[details="caption"]
 .. content ..

[/details]

meow

No, now you know it’s the limit – so don’t approach it

Yeah. That’s the reason. 116ms is HUGE. On every piece it spends 116 ms multiple times! to establish TCP connection and then transfers data very quickly and starts over. It’s huge overhead. I get 15ms to pretty much anywhere, this would explain the difference.

Storj also supports QUIC, but many ISPs aparently block it. Vast majority of traffic happens to be TCP

edgardiego · 20 September 2023 22:35

I find it pretty interesting, will check that video. Here in Spain we also have a lot of places that are almost abandoned, population is quickly accumulating in the big cities.

Therefore the web passphrases are just to access the web. The important one would be the set in duplicacy, which would encrypt the objetcs uploaded to STORJ. In addition to the passphrase also set in duplicacy for the storage itself, which would encrypt everything but on the client side.

It is cumulative? I mean, 20 threads would be the top for STORJ protocol, but doing multiple benchmarks with 19 or 20 threads would reach the limit? What prevention should I take not to take it on a daily use.

I’ve tryed changing the VPN to a simple one instead of multi hop, that caused the ping to drop to 31 on average. Could be one of the reasons.

What would you do next? Try adjusting the chunk size somehow to use the STORJ protocol or stick to S3 and do some more benchmarks to see if the reduced ping improves something?
What intrigues me is why STORJ protocl does not seem to be affected that much, is because the QUIC support you mentioned?

saspus · 20 September 2023 23:37

Right. You need to save both. Without either of them you can’t access data.

I’ts a limit on a number of concurrent connections your router can handle. It does not matter whether it is one application doing it vs another.

It could be multiple factors.

gateway may be a bottleneck, and its just one location you sent all traffic to.
inherent storj peer-to-peer parallel nature and optimizations they made to avoid slow nodes affecting performance. I.e. one of its main selling features Quoting from their article:

When you upload or download from the network, file segments are sent one after another. However, the pieces that make the segments are sent in parallel. Much effort has gone into optimizing this process, for instance when you download a file we attempt to grab 39 pieces when only 29 are required eliminating slow nodes (Long Tail Elimination). This is our base parallelism and allows up to 10 nodes to respond slowly without affecting your download speeds.

When uploading we start with sending 110 erasure-coded pieces per segment in parallel out to the world but stop at 80. This has the same effect as above in eliminating slow nodes (Long Tail Elimination).

Have a look at this: Hotrodding Decentralized Storage - Product Discussions - Storj Community Forum (official)

On a separate topic, I’ve tried telling duplicacy to create chunks of size 64 and the resulting file size is indeed honest 64MiB=67108864 bytes. So, duplicacy does not overshoot, which is nice. Using chunk size of 64 shall make the best use of the network.

saspus · 21 September 2023 02:56

I got curious, so I started tests for s3 and native of various thread counts for 64MiB chunks overnight. (it will take some time because of my measly upstream. Will report tomorrow.

script

#!/usr/local/bin/zsh

# OPTIONS:
#    -file-size <size>            the size of the local file to write to and read from (in MB, default to 256)
#    -chunk-count <count>         the number of chunks to upload and download (default to 64)
#    -chunk-size <size>           the size of chunks to upload and download (in MB, default to 4)
#    -upload-threads <n>          the number of upload threads (default to 1)
#    -download-threads <n>        the number of download threads (default to 1)
#    -storage <storage name>      run the download/upload test agaist the specified storage


function bench()
{
	(
		echo "$(date): storage: $1, chunk-size=$2, threads=$3"

		./duplicacy_main benchmark \
			-storage $1 \
			-chunk-size $2 \
			-upload-threads=$3 -download-threads=$3 \
		| egrep "(Upload|Download)"
	) | tee -a benchmark.log
}


bench s3 64  1
bench s3 64  2
bench s3 64  4
bench s3 64  8
bench s3 64  16
bench s3 64  32

bench storj 64 1
bench storj 64 2
bench storj 64 4
bench storj 64 8
bench storj 64 16
bench storj 64 32

saspus · 21 September 2023 17:34

This is what I got.

Chunk size=64MiB, Connection: 800Mbps/20Mbps (100MBps/2.5MBps), host freebsd

Upload, MBps:

type\threads	1	2	4	8	16	32
s3	1.97	2.17	2.03	2.12	2.17	2.17
native	0.710	.750	.777	.806	?	?

Download, MBps

type\threads	1	2	4	8	16	32
s3	11.0	22.2	45.9	56.4	49.0	50.0
native	11.0	16.8	31.4	57.6	?	?

It’s rather unexpected.

The question marks are failures because modem did not survive.

I’m going to do two more experiments.

Run the same form a beefy Amazon EC2 instance. I expect to see same or better results than using gateway. Because the gateway is just an instance in the cloud in the same way. You can, in fact, run your own gateway.
And test file transfers with rclone. In case maybe there is something with the version of uplink duplicacy is built against

edgardiego · 21 September 2023 18:09

Perfect, I’ll keep both secured.

So it won’t matter if I’ve been doing other tries before, as long as I don’t reach that numbers of threads everything would be okay.

Way better than doing it manually as I was doing.

So you’re getting just the opposite results than I did, with STORJ performing worse than S3.
I’ll wait before doing anything else then.

Thanks!

saspus · 21 September 2023 18:26

Yeah. Something fishy is going on. I strongly suspect my modem has to do something with it. I’ll confirm soon.

Another possibility is that a bunch of optimizations that storj did in uplink were made in the versions after the one duplicacy is built against. I’ll rebuild it with the updated version and retest.

So, the plan:

duplicacy bench on amazon cloud
rebuild duplicacy with newer uplink and try from home again
measure performance with rclone simply downloading the same files.

I’ll update later tonight.

saspus · 21 September 2023 22:44

Amazon cloud:

Shape: m4.2xlarge
Connection: High, whatever that means,
OS: Amazon Linux 2
Chunk size: 64MiB

Upload, MBps:

type\threads	1	2	4	8	16	32
s3	13.1	18.8	24.1	30.0	31.5	28.2
native	16.8	29.8	29.1	31.0	31.0	31.0

Download, MBps

type\threads	1	2	4	8	16	32
s3	11.3	22.1	48.3	79.2	72.0	77.7
native	11.2	21.4	39.6	70.0	70.0	71.0

Few observations I see:

m4.2xlarge instance seems to be bandwidth limited to somewhat 30MBps upload and 80-ish. I’m going to change the shape to an instance that provides explicitly 5 gigabit network and retry. Maybe also it was resource deprived. I’ll see if I can instance with guaranteed resource allocation.
Even one thread gets dramatically better up and downstream when not running though a consumer grade internet connection.

saspus · 22 September 2023 00:30

Shape: m4.10xlarge
Connection: 10 Gigabit/s
Zone: us-west-2 (Oregon)
Satellite: us1

Upload, MBps:

type\threads	1	2	4	8	16	32
s3	13.7	18.9	24.9	29.7	31.0	32.5
native	16.2	32.5	58.6	105.2	114.8	137.7

Download, MBps:

type\threads	1	2	4	8	16	32
s3	12.2	24.1	49.0	86.6	87.5	92.5
native	11.6	22.3	43.1	85.0	138.0	234.0

So I guess the takeaway is – s3 is good when resources are scarce (gateway will be doing all heavy lifting). Native is good when resources are not a concern and massive performance is desirable.