Completely new: where to start

saspus · 19 September 2023 04:40

Yes. Not rot prevention – because you can’t control that, it will happen sooner or later, but rot recovery. Once you discover that the specific file got corrupted – you can restore uncorrupted version of it.

It is not about where you are, but where your data is. When you are using conventional centralized provider, like B2 or Wasabi, you pick where (in which datacenter) will your data reside. Usually, based on the proximity to your geographically. But that’s just one datacenter. Even though there should be good security, sprinklers, and what not – flood, fire, earthquakes, and maybe even sabotage are still a thing. You dont’ want to lose your data in the unliky event if a datacenter burns in flames or is swallowed by San Andreas fault. So, many providers offer geo-redundancy, where they keep a copy of your data in more than one geographic locations. And charge you more for that.

Storj on the other hand, stores data distributed across the world by design. So you get this feature for free. If the entire continent sinks – your data will still be recoverable. They have a well written Whitepaper, have a look. It’s an interesting read.

Global options are these: Global Options · gilbertchen/duplicacy Wiki · GitHub. They control application behaviors. As opposed to command options, that control the specific commands behavior. Why there are separate edit boxes there in the UI - I don’t know. I’d think you should be able to just put them all into one box and let the app figure out which one is which. CLI does it anyway, so why ask the user to separate whites and colors and then plop them into the same CLI? No idea. Maybe some historic reasons.

Yep, that’s perfect. Nope, not requirements. It’s not a “password” per se, as much as is it an “encryption phrase”. I personally prefer to use Diceware so if I need to type it – I don’t have to struggle looking for special symbols. 1Password allows to generate such passkeys, perhaps bit warder does too.

Two things here. Duplciacy will create a hidden folder .duplicacy to keep its own config in the folder you call init in. Common approach is to init in the C:\Users folder, or c:\Users\you folder, and backup the whole content of Users or you.

Alternatively, you can specify where duplicacy shall create preferences directory with --pref-dir parameter to init call. This may be preferable, as you can then place it anywhere you want, for example C:\Users\you\APPDATA\Duplicacy would be a good location if you plan to backup only your user folder, or c:\ProgramData\Duplicacy if it is system wide.

You have two options here.

Use the storj:// protocol,
or usr s3:// protocol.

Since this is a home computer on the home network I suggest using s3. You can create s3 credentials on Storj web side, they will give you all the parameters you need to specify for the s3 connection, which you will configure following the format described in the Storage Backends · gilbertchen/duplicacy Wiki · GitHub under Amazon S3 section.

You would need to create a bucket first on there storj web site. There you have to choose an encryption key for your data. Since duplicacy encrypts its data anyway it’s kind of redundant, but storj is always end-to-end encrypted, so you have to pick an encryption passphrase. Create another diceware passkey. You will need to use it once to create S3 credentials. When using S3 gateway, the gateway has the encryption key, so it’s no longer end-to-end encrypted, but we don’t care – duplicacy encrypts data anyway.

Have a look at this – maybe some bits will be useful as a reference: How to use Duplicacy to backup securely to StorJ cloud storage | Curt Warfield

No need to start over. You can just move that hidden .duplicacy folder into the correct directory.

set allows to save variables like credentials and access keys to a file .duplcacy/preferences. Duplicacy should be able to save them to your windows keychain, so you probably won’ need to do it manually.

When you create a backup task in your Task Scheduler you would tick an option called “Run as administrator” or something like this. Or if you wan to run it manually in the terminal – run the terminal (or power shell, or cmd) with elevated permissions.

Nope, looks like everything is covered.

To sumarize

on storj satellite:

create bucket (Save passphrase)
create s3 credentials (provide passphrase, choose which bucket to allow access to and gives full access to that bucket)
Save s3 credentials.

On your PC, in the elevated command prompt (to use vss):

init duplicacy under C:\Users\ or c:\Users\you, or wherever else,
Provide the s3 connection parameters to the init string,
run duplicacy backup -vss

On your PC, in Task Scheduler,

follow the wizard to create a periodic backup task that would CD into the directory duplicacy was initialized in, and run duplciacy backup -vss periodically.

Then, once this is all working, you can configure another schedule to run duplicacy check – that ensures that all chunks it expects to be on the storage are indeed there, and prune – that will thin the backup history according to some plan discussed in one of the previous messages, so you don’t end up with thousands of revisions after 5 years of hourly backups :).

Edit. If you are going to init duplicacy under Users or Users\you, I would recommend configuring filters, to exclude some files or folders from backup. Stuff like Temp folders, Browser caches, etc. There is no reason to waste time and bandwidth and storage to backup those. Include Exclude Patterns · gilbertchen/duplicacy Wiki · GitHub

saspus · 19 September 2023 05:22

Missed that. Arq has two offerings. you can buy a program and use it whatever storage you want. You get to use it forever and get a year of updates. That’s what I use. I don’t pay for update every year. IN fact, I did not update for the past three years, and only updated few months back – because they added a feature I found useful (backup of ephemeral, cloud-only files).
Another option, you pay about the same amount of money annually, but get 5 licenses, and 1TB of included storage. You can buy more storage at 1TB increments. It’s an interesting proposition if you have many computers with not much data each.

This did not work for me – I prefer one license, and use cloud of my choice. I have one Mac, that syncs data for all users from iCloud, runs Arq, and backs it up to Amazon deep archive. All other family member’s Macs, phones, iPads, and what not simply keep all data in iCloud.

edgardiego · 19 September 2023 08:11

I was thinking of it as a way to improve performance for example in the case of a business with multiple subsidiaries in different continents. It is indeed something good to have in any case and I’ll try to take a look at that paper, I’m curious about some aspects of it.

Bitwarden allows to create pass without including special symbols, I went for an alpanumeric pass of around 20 digits, which may be a bit overkill.

About the rest I think I made a few mistakes, and after your explanation I would have to:

As I said I init the wrong directory, I should move the .duplicacy folder from C:/User/myuser to D:/backukfolder. I have 4 disks on my PC, firstly one SSD for SO and job related programs, a seconds SSD as a gaming library, and lastly an HDD RAID0 (D:) which I use to store data. This last one is the one I want to backup. Here I also have films and similar objetcs I don’t need backed up hence the reason for backing just one folder, In addition it simplifies the process avoiding the need for filters. By moving the .duplicacy folder all chunk size and passphrase would be the same? Will therefore D:/backukfolder be considerer the default storage?
I created the bucket, but I copied the credentials created by using the “API Key” option in the STORJ access area. I’m guessing that this would be the STORJ protocol credentials, as a result I would have the “S3 Credentials” option instead. From what I know duplicacy creates a document in the storage destination, so maybes would be recommendable that I delete the bucket and create a new one?

My first idea was to delete the bucket and the .duplicacy folders and just start over, but I wasn’t sure if it created config files elsewhere.

I think overlooked that option during the set up process.

I find that to be the most appealing option too. What I don’t get is which is the benefit of using it, I mean, It is appealing for someone like me that is not well versed in the use of CLI and want something easy to use, which is for sure not your case. What am I missing out?

saspus · 19 September 2023 17:49

Yes. Duplicacy runs entirely on the client, and the configuration files is all it needs to be able to connect to storage and know what to backup.

So I assume you connected duplicacy to your bucket using storj://… url string.

No, you don’t have to delete a bucket. You just need to create new S3 credentials, then open .duplicacy/prefernces files in a text editor and replace the storj://… url with the new s3://… url.

As long as duplicacy has access to storage it will work as if nothing happened. You can move that storage to another provider, local NAS, whatever, change the URL in preferences, or simply add a new storage with the new location and duplciacy will recognize is and continue as if nothing happened.

No need to delete anything. Of course, you can delete and start over, but why, when all you need is replace an URL

To answer your question, yes, .duplicacy folder is the only thing it writes stuff into, for better or worse.

You don’t need this, because on windows duplciacy can use system keychain and store credentials there. It’s more secure than having them plaintext in the config file.

For me it were three features that duplicacy is missing, but I hope will get sooner rather than later. I don’t care much about UI – but yes, Arq has a very good UI, so it’s a plus.

The features are as follows (most specific to macOS)

Support for Archival tier storage. I don’t want to pay for hot storage, I want to use Glacier Deep Archive for my backups. Duplicacy does not support this today.
Support for backing up user-mounted filesystems. For example, if you have multiple users logged in, and one of the users mounted the disk form the server with their credentials, and another user did the same with their own credentials. This can be a network filesystem, or virtual one, does not matter. I want to be able to backup those. Arq allows to do that using helper process impersonating each user and pumping data to the main process. With duplicacy this is impossible to achieve.
CPU throttling. I’m using a laptop, and often when it’s at 5% battery charge left, I can work for hours. But if duplicacy kicks in it runs at full throttle instantly sending laptop to sleep. Arq allows to configure amount of CPU usage.

These were dealbreakers for me, especially the first one. Otherwise duplicacy is perfect: it supports Time Machine exclusions, is very fast, has powerful, if with a bit steep learning curve exclusion engine, but it does support Time Machine exclusions.

The last one you can workaround but CPULimit utility. But the first two – need to be addressed in the app.

There is a fourth feature, that Arq implemented, that prompted me to pay for update, and I find it extremely useful. I filed a feature request for Duplicacy here: Support for dataless files

edgardiego · 19 September 2023 22:54

I ended up starting from scratch, I could not find in preferences a way to input the Access Key and the Secret Key. In addition, I had to work a bit around the URL, for some reason it required me to include a region, even though STORJ is by definition decentralized. I ended up using EU1 and It seemed to work.

My next step is to run the backup, where I’ll use -vss and a number of -threads I have yet to decide.
Regarding this aspect what I understand so far is that an increase of threads would cause a proportional increase of speed, till the top of my current bandwidth. Right know I’ve got a symmetric 1 Gbps net, maybe less since I use a VPN. What is the common approach to determine the optimum number?

I assume that this would be cheaper but with some restriction to retrieve the data, probably not a problem when talking about backups.

In my case this would not be a problem since I’m using a full tower PC, however I imagine that the use of resources could slow the PC while the backup is taking place? Is there a way to prevent this? It must be annoying if it happens while I’m doing some heavyload job or gaming.

To be honest, Arq7 seems like a good option for someone like me. It would have been a lot less painful at least
However, I’ve seen some negative comments in the forum, specially talking about the Windows version, so at this point I’ll probably just stick to Duplicacy.

Thanks again!

saspus · 20 September 2023 03:01

Oh yea. I have seen it with other services too, like Minio, you have to provide region even if the server ignores it.

There are two questions to consider. Do you actually need backup to go full speed? It shall be probably a background process that does not interfere with anything and slowly does its job.

To determine optimal number — keep increasing thread count until performance no longer increase. These cloud storage services are designed to work multithreaded. For example, Backblaze provides about 10Mbps per connection. So if you want to upload faster — you have to use multi be threads.

You can use duplicacy benchmark command to test various thread parameters quickly.

I did not realize you have fiber. You can actually try playing with native storj integration, Depending on where in the world you are it may actually provide you much better throughput, and your fiber box is less likely to succumb to many thousands connections storj client will be making. Note, with s3 when you say 10 threads — duplicacy will upload in 10 threads and there I’ll be about 10 TCP connections to s3 gateway.

With storj native integration each “connection” is parallelized internally by storj library (called uplink), so the transfers are inherently parallel. If you specify more threads on top — it multiples that effectively. This is where consumer cable modem fail. But your fiber box might survive. Worth a shot, if you are after absolute performance. Here are some measurements results: https://www.storj.io/documents/storj-univ-of-edinburgh-report.pdf

I doubt that. You can always set one thread or schedule backup at night. Windows Task Scheduler I think allows to program machine to wake, run the task, and then go back to sleep.

You should try it. There is one month trial.

From one hand, there are negative comments about everything, people never agree on anything.

On the other hand, Arq was Mac application and then ported to windows, so maybe there is some truth to it.

Also arq5 was horrible garbage — perhaps those comments relate to that? Arq7 is quite awesome.

But I agree, I think duplicacy will cover your usecase pretty well. It performs better , and free for personal use. So as long you don’t miss any features — it’s a clear winner.

edgardiego · 20 September 2023 17:21

Probably not, maybe just for the initial sync in order to avoid having the PC running for a few days.

Here in Spain is really weird to find cable modems anymore, maybe in really small towns (<500 habitants), almost all <7-10 years old installations are fiber. I thought it was kind of standard elsewhere too. However, I’ve got a really basic ZTE router provided by the ISP company, which I’m not sure if can be a limitation.

I’ll have to check that, I assume that for small changes using one thread can be more than enough without impacting performance, therefore avoiding starting the PC at night just to run a backup (probably not healthy hardware wise). I could manually launch a backup using more threads when a big pack of data is included.

As you suggested, I’ve been playing a bit with both protocols. One thing that confused me when setting the STORJ protocol up is that duplicacy asked for a passphrase (which I though was the encryption key) and after for a password for the storage. What is each one for? I actually think that the last is actually the encryption key, but did not find anything conclusive.

Probably my method is far from precise, but I’ll leave the results in case someone finds them useful:

From this I would say that for S3 the optimal would be to just use 5 threads, however I did not test if increasing a lot the number of threads above 5 would surpass the drop shown above it.
In the case of STORJ I can’t tell if the router got saturated after 20 or if it’s actually its ceiling. After 25 I tried with 18 and it gave worse results than 15, next attemp with 19 simply gave an error: Failed to upload the chunk: uplink: metaclient: rpc: tcp connector failed: rpc: context deadline exceeded.

My conclusion is that I could go either for S3 with 5 threads or go after full performance and use 20 threads with STORJ protocol. Any advice in this regard?

Really appreciate that you are taking the time to answer everything.

saspus · 20 September 2023 18:24

I live in California, the middle of a Silicon Valley, no less. We have Google Fiber, AT&T Fiber in parts of the area, but, of course, not where live specifically. I only get access to cable, and because it was designed for broadcast, the upstream is horrific by design. Hardware simply does not have enough transmit units. So I have 1000mps downstream (they offer up to 2000mbps) and 30mbps upstream. It’s the absolute max available. To their credit, they are upgrading network now to support up to 200Mbps upstream, using draft of new standard, as far as I understand, that allows modems to have configurable transievers that can be dynamically allocated between up and down stream, that some equipment vendors seem to already support, and it’s a huge deal. Can’t wait until they finish upgrading in my area. I use my own network equipment, Ubiquiti routers and access points, but the modem is ISP issued. That beast consumes 20W all the time doing practically nothing, and is not very stable. But it’s more stable than alternatives, so… I guess I don’t have much choice /rant.

The passphrase is likely to unlock the buckets. All buckets on storj are always encrypted. When you generate the S3 credentials, then the gateway holds the encryption key and gives you S3 credentials in stead. But if you use native storj client – then you fully control the key.

This is very nice! (Btw, discourse forum supports pasting images right into the body of the message. I.e. you take a screenshot into a clipboard and just paste it into the message editor. The forum will upload the image and insert markdown link.)

Few thoughts:

Did you measure this with benchmark command?
Was it with the default chunk size? It would be interesting to see comparison with default vs 32Mb vs vs 64Mb chunk size. I’m wondering if you specify max chunk size == min chunk size == 64Mb, will duplicacy actually create 64Mb files, or will it overshoot slightly? Storj segment size is 64Mb, so the closer you get to that the better will be performance (and slightly cheaper storage – per-segment fee is very small but non-zero). You can provide chunk sizes to the benchmark command.
I don’t remember if benchmark command allows to specify how much data to upload – because your connection is fast, and if the amount of data is quite small, the transfer time may have high variance.
It’s not entirely clear why is there such as sharp drop with s3 after 5 threads. It’s rather weird.

Yeah, that’s your network equipment having a stroke :). You can start ping in the new terminal window to google.com, and that would start failing as well.

I suspect that higher chunk size should provide much better performance with native storage integration even with smaller thread count. It feels like majority of time is spent setting up the connection, and not actually transferring chinks.

Storj has been working on implementing TCP FastOpen to shorten the handshake, but I’m not sure what’s the state of affairs on windows. I think today it’s only working on linux. (I’m using FreeBSD and it’s not implemented there either).

It feels from the numbers that S3 performance is hindered by something – 6Mbps at 5 threads is quite low. Perhaps it’s a combination of small chunk size exacerbated by high latency to the gateway. Can you ping gateway to measure latency?
If the gateway is “far” – then it’s another reason to perhaps use native integration. Especially since you are already getting better throughput. Increasing chunk size should let you achieve similar speed with much less than 20 threads. But if this is measured with already large chunk size – then I"m not sure what’s going on.

I’ll try to measure at home again to compare.

edgardiego · 20 September 2023 19:35

I’m very surprised to hear that to be honest. My parents live in a town of less of 3000 inhabitants and they also have 1 Gbps, symmetric for upload and download. On the other hand, I live in Madrid, where I could get up to 10 Gbps if I wanted, I’ve got 1Gbps for 30 bucks a month, which is more than I need to be honest. I would expect things to be much more advanced in the states. I’m glad that at least they are upgrading the network.

And STORJ passphrase is the same you set up on the web when creating the bucket? The one it asks if you want to be the same as the project or not.

I know! I just did it that way since I did not see any “spoiler” option or something like that. Next time I’ll probably just paste it.

Correct, this morning I did the S3 measurements, varying the -upload-threads and -download-threads. In the afternoon I did the same for the STORJ protocol.

I did not apply further instructions regarding chunk size. The storage init was with an average chunk size of 32 as you recommended, nothing else.

How could I know if it is overshooting? From the documentation I see that in the init command it can be added:

   -chunk-size, -c <size>          the average size of chunks (default is 4M)
   -max-chunk-size, -max <size>    the maximum size of chunks (default is chunk-size*4)
   -min-chunk-size, -min <size>    the minimum size of chunks (default is chunk-size/4)

Maybe with -max 64 and -min 64 it can be forced to always use 64 MB chunks.
In the case of the benchmark I think it could be done:

-chunk-count <count>         the number of chunks to upload and download (default to 64)
-chunk-size <size>           the size of chunks to upload and download (in MB, default to 4)

It allows it, but I left it by default, which is 256 Mb.

I believe so too After that I could barely browse on Google, so I turned off the router for about 10 seconds for it to restart. How could I benchmark it extensively without this problem? Should I have to restart it everytime?

ping gateway.storjshare.io reports 116 ms average. As said, I’m using a multi hop VPN which could being making it slower.

I could try making some measures if you think it can be interesting. Just let me know!

saspus · 20 September 2023 21:15

That’s a direct consequence of absolute vastness of the country and low population density – any infrastructure projects are ridiculously hard to finance. Including transportation. It’s quite a self-inflicted issue; San Jose, for example, the largest metro are in the South Bay, is zoned for single-family housing exclusively. It makes no sense. Youtube channel Not Just Bikes has a few videos on this topic.

I’m glad you asked!

What is encrypted on storj buckets are object. Files. The illusion that project, or bucket have an associated passphrase is just an illusion, for convenience managing buckets on the web. They don’t mean anything, they are just defaults.

They overhauled UI recently, because before it was even more confusing – it was making much stronger impression that it’s the bucket that is encrypted, and to see files you can decrypt the bucket. This is not true.

You can, in fact, have multiple files encrypted with different passphrase in the same bucket. Then when you list the bucket – you will only see files that can be decrypted by the provided passphrase. Or none, if passphrase is wrong. This is a quite a useful feature: for example, if you are developing an app that stores user data – you can use single bucket and dump data from all users into it encrypted with user-specific passwords and all the security separation is handled automagically.

So answering your quesiuton – passphrase can be anything. I usually just put random garbage when I create bucket – because I’m not going to use that passphrase. I then go generate credentials, and it asks for passphrase again – that’s the one I’m going to create and save in my password manager. Because that’s the encryption keys that will be needed to access files uploaded using that credentials.

Does it make sense?

I don’t think it’s in the menu – but this works:

[details="caption"]
 .. content ..

[/details]

meow

No, now you know it’s the limit – so don’t approach it

Yeah. That’s the reason. 116ms is HUGE. On every piece it spends 116 ms multiple times! to establish TCP connection and then transfers data very quickly and starts over. It’s huge overhead. I get 15ms to pretty much anywhere, this would explain the difference.

Storj also supports QUIC, but many ISPs aparently block it. Vast majority of traffic happens to be TCP

edgardiego · 20 September 2023 22:35

I find it pretty interesting, will check that video. Here in Spain we also have a lot of places that are almost abandoned, population is quickly accumulating in the big cities.

Therefore the web passphrases are just to access the web. The important one would be the set in duplicacy, which would encrypt the objetcs uploaded to STORJ. In addition to the passphrase also set in duplicacy for the storage itself, which would encrypt everything but on the client side.

It is cumulative? I mean, 20 threads would be the top for STORJ protocol, but doing multiple benchmarks with 19 or 20 threads would reach the limit? What prevention should I take not to take it on a daily use.

I’ve tryed changing the VPN to a simple one instead of multi hop, that caused the ping to drop to 31 on average. Could be one of the reasons.

What would you do next? Try adjusting the chunk size somehow to use the STORJ protocol or stick to S3 and do some more benchmarks to see if the reduced ping improves something?
What intrigues me is why STORJ protocl does not seem to be affected that much, is because the QUIC support you mentioned?

saspus · 20 September 2023 23:37

Right. You need to save both. Without either of them you can’t access data.

I’ts a limit on a number of concurrent connections your router can handle. It does not matter whether it is one application doing it vs another.

It could be multiple factors.

gateway may be a bottleneck, and its just one location you sent all traffic to.
inherent storj peer-to-peer parallel nature and optimizations they made to avoid slow nodes affecting performance. I.e. one of its main selling features Quoting from their article:

When you upload or download from the network, file segments are sent one after another. However, the pieces that make the segments are sent in parallel. Much effort has gone into optimizing this process, for instance when you download a file we attempt to grab 39 pieces when only 29 are required eliminating slow nodes (Long Tail Elimination). This is our base parallelism and allows up to 10 nodes to respond slowly without affecting your download speeds.

When uploading we start with sending 110 erasure-coded pieces per segment in parallel out to the world but stop at 80. This has the same effect as above in eliminating slow nodes (Long Tail Elimination).

Have a look at this: Hotrodding Decentralized Storage - Product Discussions - Storj Community Forum (official)

On a separate topic, I’ve tried telling duplicacy to create chunks of size 64 and the resulting file size is indeed honest 64MiB=67108864 bytes. So, duplicacy does not overshoot, which is nice. Using chunk size of 64 shall make the best use of the network.

saspus · 21 September 2023 02:56

I got curious, so I started tests for s3 and native of various thread counts for 64MiB chunks overnight. (it will take some time because of my measly upstream. Will report tomorrow.

script

#!/usr/local/bin/zsh

# OPTIONS:
#    -file-size <size>            the size of the local file to write to and read from (in MB, default to 256)
#    -chunk-count <count>         the number of chunks to upload and download (default to 64)
#    -chunk-size <size>           the size of chunks to upload and download (in MB, default to 4)
#    -upload-threads <n>          the number of upload threads (default to 1)
#    -download-threads <n>        the number of download threads (default to 1)
#    -storage <storage name>      run the download/upload test agaist the specified storage


function bench()
{
	(
		echo "$(date): storage: $1, chunk-size=$2, threads=$3"

		./duplicacy_main benchmark \
			-storage $1 \
			-chunk-size $2 \
			-upload-threads=$3 -download-threads=$3 \
		| egrep "(Upload|Download)"
	) | tee -a benchmark.log
}


bench s3 64  1
bench s3 64  2
bench s3 64  4
bench s3 64  8
bench s3 64  16
bench s3 64  32

bench storj 64 1
bench storj 64 2
bench storj 64 4
bench storj 64 8
bench storj 64 16
bench storj 64 32

saspus · 21 September 2023 17:34

This is what I got.

Chunk size=64MiB, Connection: 800Mbps/20Mbps (100MBps/2.5MBps), host freebsd

Upload, MBps:

type\threads	1	2	4	8	16	32
s3	1.97	2.17	2.03	2.12	2.17	2.17
native	0.710	.750	.777	.806	?	?

Download, MBps

type\threads	1	2	4	8	16	32
s3	11.0	22.2	45.9	56.4	49.0	50.0
native	11.0	16.8	31.4	57.6	?	?

It’s rather unexpected.

The question marks are failures because modem did not survive.

I’m going to do two more experiments.

Run the same form a beefy Amazon EC2 instance. I expect to see same or better results than using gateway. Because the gateway is just an instance in the cloud in the same way. You can, in fact, run your own gateway.
And test file transfers with rclone. In case maybe there is something with the version of uplink duplicacy is built against

edgardiego · 21 September 2023 18:09

Perfect, I’ll keep both secured.

So it won’t matter if I’ve been doing other tries before, as long as I don’t reach that numbers of threads everything would be okay.

Way better than doing it manually as I was doing.

So you’re getting just the opposite results than I did, with STORJ performing worse than S3.
I’ll wait before doing anything else then.

Thanks!

saspus · 21 September 2023 18:26

Yeah. Something fishy is going on. I strongly suspect my modem has to do something with it. I’ll confirm soon.

Another possibility is that a bunch of optimizations that storj did in uplink were made in the versions after the one duplicacy is built against. I’ll rebuild it with the updated version and retest.

So, the plan:

duplicacy bench on amazon cloud
rebuild duplicacy with newer uplink and try from home again
measure performance with rclone simply downloading the same files.

I’ll update later tonight.

saspus · 21 September 2023 22:44

Amazon cloud:

Shape: m4.2xlarge
Connection: High, whatever that means,
OS: Amazon Linux 2
Chunk size: 64MiB

Upload, MBps:

type\threads	1	2	4	8	16	32
s3	13.1	18.8	24.1	30.0	31.5	28.2
native	16.8	29.8	29.1	31.0	31.0	31.0

Download, MBps

type\threads	1	2	4	8	16	32
s3	11.3	22.1	48.3	79.2	72.0	77.7
native	11.2	21.4	39.6	70.0	70.0	71.0

Few observations I see:

m4.2xlarge instance seems to be bandwidth limited to somewhat 30MBps upload and 80-ish. I’m going to change the shape to an instance that provides explicitly 5 gigabit network and retry. Maybe also it was resource deprived. I’ll see if I can instance with guaranteed resource allocation.
Even one thread gets dramatically better up and downstream when not running though a consumer grade internet connection.

saspus · 22 September 2023 00:30

Shape: m4.10xlarge
Connection: 10 Gigabit/s
Zone: us-west-2 (Oregon)
Satellite: us1

Upload, MBps:

type\threads	1	2	4	8	16	32
s3	13.7	18.9	24.9	29.7	31.0	32.5
native	16.2	32.5	58.6	105.2	114.8	137.7

Download, MBps:

type\threads	1	2	4	8	16	32
s3	12.2	24.1	49.0	86.6	87.5	92.5
native	11.6	22.3	43.1	85.0	138.0	234.0

So I guess the takeaway is – s3 is good when resources are scarce (gateway will be doing all heavy lifting). Native is good when resources are not a concern and massive performance is desirable.

edgardiego · 22 September 2023 14:26

If I have understood properly these results were obtained using professional grade internet connection. Were STORJ protocol seems to scalate way more than S3. However, they would not be representative for a domestic setup.

Then would you recommend using S3 for a consumer grade installation? In my personal case, and noting that the benchmark I did was kind of limited, apparently I could use STORJ to outperform S3 when using more that 15 threads for download, and almost in every case for upload. However, maybe this could change if chunk size can be strictly set to 64Mb.
Anyway, I’m not sure of which impact would this have on PC/Router requirements and stress while running.

saspus · 22 September 2023 16:59

I’m not sure what professional grade is in this context, but it was a low latency connection indeed. It seems the performance, regardless of the endpoint, is strongly affecteby by latency, as expected.

I would recommend native integration on fiber connection, and S3 otherwise.

But you have a kind of interesting situation: you have a fiber connection, but traffic goes through a high latency VPN.

I don’t know your network environment, and reasons for using VPN, so cannot suggest how to configure your networking to bypass VPN for duplicacy traffic. Or probably better approach would be to only send traffic over VPN that actually needs to go to vpn network, as opposed to everything. That 100ms latency affects everything, not just backup.

Native backend will create an order of magnitude more connections, as described in that forum post on hotrodding storage . That should be fine for most decent routers, but some home routers may struggle.