Completely new: where to start

edgardiego · 15 September 2023 19:14

Hello everyone,

I’m completely new to the world of backups, and I’ve started to become concerned about the security of my data. I’m doing a completely domestic use, I have approximately 500GB of data so far, primarily consisting of photos, videos, and some documents. Up until now, I’ve been using FreeFileSync to sync my HDD to an external SSD and to OneDrive via the folder created by the desktop app. However, I realized that this setup doesn’t protect me from data corruption and other potential failures. So, I decided to purchase a Duplicacy license, mainly because having a GUI would simplify things for me.

My plan is to continue doing what I was doing with FreeFileSync, but with the added benefits of encryption and version history that Duplicacy provides. However, after some initial tests, I’ve noticed that the backup and recovery speeds are quite slow. I suspect this might be due to a configuration mistake on my part. I’m completely lost when it comes to aspects like how many threads I should use and other configuration settings.

Where should I start in terms of optimizing Duplicacy for my needs? Besides the GUI user guide, are there other resources where I can learn more?

Any advice is more than welcome.
Thank you for your help!

saspus · 15 September 2023 20:24

Backup to and from OneDrive can be slow, and you can’t use more than a couple of threads to attempt to increase throughput through concurrency.

I would recommend backing up to a cloud storage provider — like Backblaze or Storj or Wasabi, where you only pay for what you use and performance is pretty much limited by your upstream bandwidth.

You may also want to increase average chunk size, especially if your data is photos or videos.

On the other hand, neither photos nor videos benefit from compression, deduplication, or versioning, so perhaps you can simply use rclone to uploade those to a low cost coldline or archival storage with object lock enabled, so that nobody can delete or modify them, and use duplicacy only for your ever-changing documents, that evolve and will benefit from the versioning and deduplication. Because this will be much smaller amount of data, cost difference between various providers will be negligible.

In summary, with just 500GB of data, most of what is immutable media, I would:

copy media and other immutable data to a low cost Amazon bucket, including Glacier or even Glacier deep archive, using any tool of your choice, like rclone or mountain duck
backup the rest with duplicacy to either also Amazon cold line storage or any other provider like Backblaze B2 or Wasabi, or STORJ (the latter is currently my favourite, due to cost, built in-in geo redundancy by design, and inherent performance. For other storage providers performance drops the further you are from a datacenter. Storj is built distributed)

If on the other hand you don’t want to micromanage everything — you can backup everything with duplicacy to STORJ (or Backblaze, or Wasabi)— costs start at $0.004/TB/month, so for your dataset it will be couple of bucks a month.

edgardiego · 15 September 2023 22:32

Thank you for your quick response.

I’m uncertain about how to increase the chunk size in Duplicacy, so I’ll need to explore that further. Regarding the number of threads, I’m currently using the default setting of 4, and I’m unsure if increasing it would lead to noticeable speed improvements.

Given that static media (even the documents I’m keeping are almost never modified) may not benefit significantly from compression and versioning, is there any valid reason not to continue using FreeFileSync? (ot checks for differences between directories and just straight up copies and pastes what’s necessary.) I initially thought that versioning could be useful in case of data corruption, but as a newcomer to this, I’m not entirely sure if this is a valid concern or just a misconception on my part.

I’ll investigate SORTJ as well. For now, I’m considering uploading my data all at once until I gain a better understanding of the process.

Your assistance is greatly appreciated.

saspus · 16 September 2023 01:44

It’s a parameter to init command, but there are downsides do increasing the side too: less space utilization efficiency, depending on the nature of your data.

With some remotes, such as Storj, you want average chunk size to be close to 64MB. Default duplicacy chunk size is 4MB.

For my backups I set average chunk size to 32MB and this seems to work fairly well.

With OneDrive? It will actually get slower, due to throttling and back-off timeouts. With other object storage providers — it will scale linearly. I.e. 10 threads would be 2x faster than 5. But I agree, with backup this may not be an issue — as long as all data generated daily manages to get uploaded within 24 hours — the speed is good enough.

I’m actually surprised OneDrive tolerates 4 threads from you: when I was testing it few years ago I could only use max 2 threads before throttling would kick in. Same was with Dropbox. Google drive on the other hand did not mind this kind of abuse at all.

Have a look at this recent thread: Onedrive as storage?

Yes, exactly the concern you quoted: if corruption occurs, it will be replicated to the good copies. That’s the nature and purpose of sync— diligent data replication.

Instead, you want slightly different behavior; first “sync” shall succeed, but subsequent sync of the changed file shall fail — so that if the file has changed (and the only reason for it to change is corruption) — it would be impossible to modify the original.

There are few approaches how to achieve that: for example, with some storage providers you can generate access keys that only allow upload but not modify, rename, or delete.
Then your sync tool will simply fail to sync corrupted version.

S3 object-lock can accomplish the same: uploaded objects are locked form any modification by anything for a specified amount of time.

Some sync tools (including rclone) offer a basic “backup” feature — where they sync changed files into a subfolder next to the original. While you don’t really care about uploading corrupted new version, this is just a way to ensure it does not overwrite a good original.

You can access storj two ways: using native integration, (storj: url) or using s3 gateway.

For most home users the latter will provide better outcome:

native storj protocol works peer to peer, and creates a lot of connections to storage nodes. This is its strength and robustness but this also can knock out most residential internet equipment. I have an enterprise router at home, but ISP issued cable modem, and while router is fine, the modem goes offline if I use 20 threads with duplicacy when using native storj integration.
using storj hosted s3 gateway makes it behave just like any other s3 storage, and does not require massive number of tcp connections. The gateway handles talking to peer-to-peer storage nodes.
another benefit of hosted gateway is upstream traffic savings: native storj remote uploads about 2.7x more data (this is how storj network works: data is chunked up to 64MB pieces, shredded into erasure encoded segments, and uploaded to 80 storage nodes. Only 29 segments is sufficient to recover the segment — and this provides massive resilience they promise. This is where 2.7x upload amplification come from. When you use gateway that magnification happens at the gateway)
another downside of using a gateway — it “centralizes” the decentralized storage, so if you live far from the gateway — your latency and performance could be worse compared to when you upload to decentralized swarm.

But is a deep rabbit hole, and is specific to storj, we can discuss it further if you want.

Awesome, I was going to suggest the same :).

The simpler is the solution — the better. Start backing up the whole thing with duplicacy with default settings. See how it goes. 500GB is not that much data at all, optimizing it to squeeze extra $.20 per month would not be worth the extra complexity. Maybe even OneDrive will be enough, and you will save some money (assuming you already have subscription for other purposes). Test restore too — see if you need to restore in a pinch the performance is fine.

The restore performance can decrease dramatically the more data you have backed up: duplicacy uploads a lot of small files into a few folders; this is not a problem for object storage services like S3 and B2, Storj and Wasabi, but it’s an unusual usecases for OneDrive, DropBox and other “file sharing” services: Enumerating those files may take obscene amount of time for large backups, they are not optimized for these scenarios as normally users don’t have millions of doc files in the folders. This also means they don’t have an incentive to fix any inefficiencies.

The problem is — you will only notice it once you backup enough data. And sunken cost fallacy may kick in

Last bit — if your OS is macOS or Windows, consider using -vss flag with backup. This will have duplicacy create a local filesystem snapshot and backup that snapshot. This ensures that all files in the backup are taken in the same instant of time, regardless of how long actual data transfer takes.

edgardiego · 16 September 2023 19:27

I’m trying to find that option in the GUI but I simply cannot find it.

Maybe I’m actually suffering throttling and I’m not even aware. How could I detect it? Anyway I could try with 2 and just see if it improves anything.

About anything else you are proposing, most of the what you have shared is way beyond my actual knowledge. I’m trying to follow up as much as I can, searching everything I don’t understand (not being an english native speaker is also a barrier when dealing with technicisms).
What I’ve understood so far: data corruption still a possibility even with static data, using FreeFileSync therefore is not an option since it does not provide any protection against it.
You recommend using Storj for performance reasons (which for what I’ve seen pays per TB uploaded and downloaded, this is pretty unclear for me since I’m more used to the way billing works on OD, Dropbox and similar services), but OD could be usable. So far I’m using OD for the only reason of it including Office for a similar price of its competence.
About the software you mentioned rclone, which I believe is CLI based, I chose Duplicacy because I feel more comfortable having a GUI. Is not Duplicacy a good option then? In the last part you go back to Duplicacy, and here I assume you mean alongside OD. In case of finally going to Storj would I have to use S3 gateway? I figured out that it must be another backup software, but after googling it It appears to be an Amazon cloud service. I’m pretty lost in this part.
The -vss part I assume would be in the Backup tab of Duplicacy, next to -threads, what is the difference between a Command and Global option?

To conclude, I want to thank you for answering with so much detail. I know that probably most of what I’m asking are ridiculous questions, and that must be frustrating to lose time with someone that is so off track. I think that I’m getting into something really above what I’m capable of doing, maybe I should just stick to physical backing. These seems a bit overkill for such a mundane use.

Thanks again!
Edit: I’ve kept searching for information and found out Backblaze offers a cloud backup for domestic use which is 5.42/month when getting a 2 years plan. However, even though backblaze is widely recommended in forums I’ve never seen anyone mentioning it, is there a reason for that?

saspus · 16 September 2023 23:04

I don’t think there is an option in the GUI to specify this. Hmm… I don’t use GUI myself, so I’m not sure if it’s possible to specify at all, perhaps manually somewhere?

If you click on the completed or in-progress backup in the dashboard you shall see logs. The back-offs shall appear in the logs. You can increase verbosity of the logs if you click on the - in the list of backups and add -d global option.

Correct.

[Assuming OD is OneDrive].

With Storj, and Backblaze you don’t pay for upload, only for storage and download. With Wasabi you only pay for storage, download is free, up to the amount stored. But with backup you rarely if even download – so it does not matter.

Storj charges $0.004 per GB per month stored, plus some small amount per-segment fee. Segment is a file – they want to to use larger files, to improve performance, and network efficiency, hence the advice to increase duplicacy chunk size. They charge $0.007/GB for download

BackblazeB2 charge $0.005 per GB per month to store, and $0.01 /GB for download.

Not only it’s slightly more expensive, but also centralized, so unless you live close to their datacenter, performance may suffer.

This is much preferable to the cost structure of “*Drive” services, because you pay for exactly what you use as opposed to pre-paying for the next largest tier. With Drive services you are always overpaying, because you pay for the space even if you don’t use it. Does it make sense?

You can still use OneDrive and other Drive storage for duplicacy - but performance will be so-so. It makes sense to do if you already have, e.g. OneDrive subscription for other reasons, and have leftover storage. Then it’s “free” storage for backup,

I’m very sorry for the confusion!!

S3 is a protocol that was developed by Amazon for their storage service, but pretty much every other storage provider adopted it too, including Storj.

You can access Storj using their native protocol, or via their S3 gateway. So my suggestion was to use Duplicacy, with STORJ, via their S3 gateway.

This will result in a cheap solution, with no extra load on your network.

Correct. Note, this will require duplciacy to run as Administrator, or root user, because creating snapshots requires elevated credentials.

I never understood the point of separating them either!. I guess global options work for all commands, and the Command options only with the specific command, but this does not explain why are they separate in the UI. Makes no sense to me either.

Nonono, this perhaps sounds much more difficult that it needs to be :), it’s my fault for not explaining better.

Let’s make it super simple.

You already have your OneDrive subscription and extra space, right? Add the OneDrive storage to Storages, and configure the periodic backup to that storage, with 2 threads. This is all doable in the UI.
That’s all you need.

What OS are you using? Assuming windows, when you install duplicacy web UI – run the installer “As administrator”. Then you will see an option to install it “for all users”. In this case the -vss flag with work.

Let it run for a few weeks, see how it works. Do a few test restores. See if performance is adequate. If it is – nothing else to do. If it isnt’ – you can explore other storage providers, like Storj or B2.

Personally I use CLI, and I’m not well versed on the current capabilities of the UI version.

Their Personal and Business “Unlimited” backup? In my opinion, having tried that in the past, two yeas apart, it was horrible. It’s a rehouse hog, it clogs your upstream connection uploading trash temporary files, while your important documents can stay in the queue for up to 48 hours. Don’t use it. It’s surprisingly bad.

When people recommend backblazde, they usually refer to Backblaze B2 service, pure storage that you can use with other apps. No their horrible attempt at backup application.

If you want to try other backup applications one option to consider is Arq 7. It’s super polished and extremely easy to setup.

(I also have an anti-recomendation: don’t use Duplicati [sic]. It does not have a stable version and is not very robust or reliable, even though it appears high in the search results).

But if you are OK using command line tool (which I totally recommend figuring out how to use – it’s easier than it sound) then duplicacy CLI would be the clear winner. You can configure it to perfection and schedule with Windows Task Scheduler instead of an UI…

Please feel free to continue asking for clarifications if something is not clear, that’s the whole purpose of the forum – for the users to help each other! I don’t consider time helping someone else “lost” in any way. On the contrary.

edgardiego · 17 September 2023 09:58

From what you are saying the UI is kind of limited, I guess maybe it is worth to give a shot to the CLI version, even thought the learning curve may be a bit stepped. I have rarely used a command line since college, I’m kind of afraid of messing something up because of not having a GUI to be honest, that was the reason to get Duplicacy on the first place. However, I don’t find many documentation for the GUI version, so It would probably be easier to actually learn how to use the CLI version.

After reading the whole post, you recommend Duplicacy against other CLI based options such as rclone?
Not sure If they have a different set of features, so maybe I’m just comparing apple to pears.

So I would require to make a complete try to notice it, there is not a include benchmark option I assume.

Wasabi up to my understanding work similarly to most Drive option in the billing aspect, 7$/tb without a download fee. B2 and Storj work similarly, with Storj being slightly cheap and, depending on the location the nearest datacenter, faster. Why is B2 so much popular then?
I’ve created an Storj account already to take a look and, besides the crypto payment which I’m not familiar with, everything looks pretty straightforward. So far It is my main alternative to OD.

To be honest, I’m getting more inclined to directly abandoning OD, as you say I’m not using the whole TB I’m paying for. Getting my backup in Storj, rounding my needs up to 500gb would be roughly 2$/month against the 7$ I’m paying for OD. Even though my decision to use it was the inclusion of Office, which is a good addition even thought I really don’t use it that much, I could look up for some alternatives. Moreover, Storj seems to be way more flexible, not only for backups. Keeping both services would not make much sense in the end, so I assume I will have to choose between giving up Office, or settling for a way poorer backup solution.

I get it now, so I could use Storj with their protocol, which I guess would be directly through the web, or with a propietary app, or through Duplicacy and S3.

That’s correct, I’m using Windows 11. I guess this is the “install as a service” checkbox I simply ignored. After reinstalling Duplicacy with that option, I would have to use -vss so it created an snapshot to upload all the documents with the same timestamp. I could do that by specifying it on each backup as a “command option” or just once by using a “global option”.

I’ll just discard the Backblaze backup service then, after checking Arq 7 it is 50$ per computer (and I’m assuming an unique payment but you have to pay for updates), and I believe I would have to pay for a storage provider anyway, so I don’t get which is the advantage, am I overlooking something?

Really appreciate your answers, I’ve actually got a much clearer idea of what we’ve discussed so far. I’ve got some other question relative to how actually backups work.

First of all, I get that in case of a disk failing, a ransomware attack, or any other circumstance that ends up with your data being lost, it is as simple as retrieving a copy from the cloud. What I actually don’t get is how it prevents data corruption for documents that are rarely visited. For example, in my case one of the things I want to protect area all my photos, I have some dated 15 years back that, even though I obviously don’t look them up frequently, I want to remain protected. I get that duplicacy would create versions of each photo and keep them for 1800 days, for example. (I think this was the by default option). However, If an image gets corrupted but I don’t open it in more than 1800 days, therefore not noticing that there is an error, at some point the backup will be overwritten by the faulty one. Does duplicacy has any system to check and look up for data corruption?

Other aspect is encryption, I get that by dividing in an specific way all the data into chunks, they can be only retrieved by someone that know how they were split. But how could I recover the data from another PC, I assume that it must have some kind of pass.

Thanks again!

saspus · 18 September 2023 06:02

Yes, that’s about right. GUI gets the simple jobs done and is probably enough for most people, but if you want finer control – CLI is easier, because you don’t have to fight the middle men, and figure out how to tell GUI what you want to tell CLI. In addition, for personal use CLI is free, so there is that.

Yes, sort of. Rclone is a sync tool (no vessioning, beyond rudimentary "copy different file to another folder), Duplicacy–a proper versioned, point-in-time, backup, with configurable version history. You need backup, so use duplicacy. But since most of your files were media – I suggested rclone as an alternative, since you don’ need versioning for those files.

But if you want to keep it simple – use duplicacy for everything, including media files. it will work just fine.

See, this is one of the things that CLI supports but is not exposed in GUI. Duplicacy has a duplicacy benchmark command where it can benchmark storage.

Unfortunately, this may not tell you the full picture. If you storage is empty, benchmark will upload a few test files and check performance, and it will be fast; but once your storage fills up, and the target contains hundreds of thousands of files, enumerating those file scan be very slow on OneDrive, as it was designed for a different usecase. Increasing average chunk size can help here too, by reducing the number of chunks required, at the expense of slightly worse deduplication. But it will be net benefit in the end.

I dislike wasabi for their predatory pricing. They claim it’s a “hot storage” at simple pricing but in reality there are gotchas:

You will pay for minimum 1TB regardless of how much you use
If you delete a file before storing it for 3 month you will pay a fee equal to cost of storing it for 3 month.
And if you download more than what you store, they can ban your account. Not a nice prospect.

It benefits from having started earlier.

Dont’ worry about crypto tokens. Storj runs a network of independent node operators located all over the world, who contribute underutilized resources on their servers and get compensated for doing so. Figuring out tax codes and international payments is insurmountable task, so they’ve created a utility token on etherium network to facilitate the payments: They calculate payment in USD, convert to tokens, and pay operators. Operators can convert tokens to their local currency and deal with local taxes themselves.

As a customer, you can pay for storage with a plain old credit card. But they also accept their own tokens, because it would have been silly if they did not

Can’t agree more. I’ve discovered storj few years ago, and using it for everything now, it’s awesome. It’s way underpriced for what it is. geo-redundant storage that behaves like a world-wide CDN at $4/TB !? come on! If you want geographic redundancy with B2 or amazon – you have to pretty much pay double.

Good point. I was a big Microsoft services aficionado in the past, but they keep going downhill since that office365 rebranding, so I’m no longer using nor recommending them. Google Docs/ Spreadsheets and Apple Pages/Numbers more than fullfill my needs (and I was quite a power user of Excel) so paying for office subscription when you can have same or better tools for free seems counterproductive.

Right. Their service talks their own protocol, but they also run separate servers that “bridge S3 to their own protocol” so customers that have software that already supports (the defacto standard) s3 protocol can easily switch to storj with their existing software.

The web UI is really useful just for small tasks, they are intended to be used via the storage API, by other apps, like Duplicacy, rclone, etc.

At each backup as a command option. If you put it to “global” options duplicacy won’t like it, if I remember it correctly.

Awesome!

Very good question.

Some filesystems have built-in checksumming support, and can verify if the file got corrupted, and even restore if from a redundancy information if that is available. Systems that run such filesystems often run a periodic task that reads every file, compares checksums and repairs any rot. This is usually feature of a NAS appliances. General purpose home machines, including Windows, macOS, etc, don’t have that capability, and you are absolutely right, if you keep an old photo on an HDD in a 7-10 years it will likely rot. On an SSD – even faster, especially if SSD is not kept powered.

Technical term is bit rot

This is a very valid concern that many overlook.

Duplicacy cannot distinguish file corruption vs deliberate change. So it will see - -oh, file changed, let me backup this new version – and indeed back it up as a new version. Now if you notice at some point in the future that oh, this photo got corrupted – you should be able restore the older, uncorrupted version from the backup. That’s where long version history comes handy and why for something to be called “backup” it has to have a long version history.

Keeping data for just 1800 days is not enough. It’s just a default, you can then manually edit the command line (the -keep options) to specify any other suitable retention. A common approach is to:

keep all backups made in the last two weeks
After two weeks, keep a version every day
After 90 days keep, a version every week
After one year keep a version every month

If data does not change – new versions don’t occupy any extra space. The keep parameters for the prune command to implement this scenario may look like so: -keep 31:360 -keep 7:90 -keep 1:14 -all

By the way, the same concern is valid with the respect to target storage. If, for example, you are backing up to a local USB HDD – it too, can rot. And corrupt the chunk data that duplciacy write, as a result you would not be able to restore some data.

Duplicacy offers a feature, called “erasure coding” where you can configure it to write chunk data with redundancy, to be able to tolerate occasional bit-flips, but it’s not a panacea. Ideally, you want to avoid backing up to the isolated hard drives, as they cannot provide data consistency guarantees. Cloud storage providers – usually do. Storj, for example, encrypts all data by default with the user-generated keys, so if data was to be corrupted it would just simply fail to decrypt; therefore by design it can’t return you bad data.

Someone suggested me a cardinally different approach for archiving family photos and videos – a low tech solution that involves Blue Ray M-Disk with stated 1000 years retention. That’s something to consider.

When you init the storage you have an option to encrypt it (see the -e flag in the init command. Then to both, do backups, and do restores, you would need to provide that encryption key. Without that key data just looks like noise and unreadable by anyone. (Duplicay can use keychain on macOS and Windows, so you don’t actually have to specify it every time of course)

For advanced scenarios duplicacy supports an interesting feature: RSA encryption. it allows you do generate a key pair, and duplicacy will use the public key for backup. To restore, however, you will need to provide private key. This is a useful scenario if you backup from multiple computers to the same storage, to take advantage of duplicacy’s cross-machine de-duplicaton (a major selling point) but don’t want users to be able to restore each others’ data.

edgardiego · 18 September 2023 16:24

I would say that it missess some important options even for someone that is starting, not being able to modify chunks size for example seems like an important parameter optimization wise. Being free would be and advantage if I had not already paid for the GUI version, which seems I´m not actually going to use .

As you say, I´ll probably use only duplicacy for simplicity´s sake, learning one CLI based program seems already complex enough, as someone new in this area I have to find out what each parameter does before breaking something. However, and I know I already asked this before, but I want to be sure I really understood; versioning could be useful even for media as a bit rot prevention policy.

Sunk cost fallacy once again

I guess that I wont benefit from geo-redundancy as a domestic user, however if it provides the same services for a lower cost it is indeed an easy choice.

You are completely right, I´m decided to leave office365 and go for STORJ + google suite, apple suite, or libreoffice instead (I´ll have to compare between them).

What is the purpose of global options then?

Seems like a kind of ¨brute force¨ approach, but easy to implement. The only way that I could think to actually have a control over this would be having two separate back ups and somehow compare between them periodically. I guess that big companies sure have this sorted out already.

I use Bitwarden for all my passwords, I assume then that by keeping that encryption key there I could recover my data even from another computer. Does this password have to comply with some specific requirements?

I´ve started to tinker a bit around with the CLI version to use it alongside STORJ, which so far is being an slow and confusing process for somene that has never used cmd on windows.
After reading some documentation and with the help that has been provided here I´ve come to the next conclusions.

I´ve renamed the exe to duplicacy.exe and placed it on C:/Program Files/Duplicacy, then added said exe to PATH using the Environmetal Variables editor.
I assume the next step would be to init the storage by going into the folder I want to backup and then: duplicacy init -e -c 32 storage_name storage_url
Then I would have to enter the encryption pass. I used 32 mb following you previous recommendations, since I believe that I cannot modify the average chunk size later. About the storage_name it would be wherever I want to name it, STORJ or whatever; about the URL I used the “Create Keys for CLI” on STORJ web.

Edit: after figuring out that I had to do it on Powershell instead of cmd and a few attempts I managed to init the storage, however I forgot to move to the directory I wanted to backup and did it on C: instead, is there a way to change this or do I have to start over? In case of having to start from scratch, what should I delete? I thought that set maybe could allow me to do it, but I don’t see any way to change the repository. I want to avoid C: being the repository assigned to default.

Then I would have to to use backup -vss (which I think I have to give root permisions somehow before), but I still have to check how it works.
Am I missing something important? What worries me the most is that some options are non-modificable, I rather go slowly than having to start all over again later.

Thanks again! Really appreciate such detailed answers.

saspus · 19 September 2023 04:40

Yes. Not rot prevention – because you can’t control that, it will happen sooner or later, but rot recovery. Once you discover that the specific file got corrupted – you can restore uncorrupted version of it.

It is not about where you are, but where your data is. When you are using conventional centralized provider, like B2 or Wasabi, you pick where (in which datacenter) will your data reside. Usually, based on the proximity to your geographically. But that’s just one datacenter. Even though there should be good security, sprinklers, and what not – flood, fire, earthquakes, and maybe even sabotage are still a thing. You dont’ want to lose your data in the unliky event if a datacenter burns in flames or is swallowed by San Andreas fault. So, many providers offer geo-redundancy, where they keep a copy of your data in more than one geographic locations. And charge you more for that.

Storj on the other hand, stores data distributed across the world by design. So you get this feature for free. If the entire continent sinks – your data will still be recoverable. They have a well written Whitepaper, have a look. It’s an interesting read.

Global options are these: Global Options · gilbertchen/duplicacy Wiki · GitHub. They control application behaviors. As opposed to command options, that control the specific commands behavior. Why there are separate edit boxes there in the UI - I don’t know. I’d think you should be able to just put them all into one box and let the app figure out which one is which. CLI does it anyway, so why ask the user to separate whites and colors and then plop them into the same CLI? No idea. Maybe some historic reasons.

Yep, that’s perfect. Nope, not requirements. It’s not a “password” per se, as much as is it an “encryption phrase”. I personally prefer to use Diceware so if I need to type it – I don’t have to struggle looking for special symbols. 1Password allows to generate such passkeys, perhaps bit warder does too.

Two things here. Duplciacy will create a hidden folder .duplicacy to keep its own config in the folder you call init in. Common approach is to init in the C:\Users folder, or c:\Users\you folder, and backup the whole content of Users or you.

Alternatively, you can specify where duplicacy shall create preferences directory with --pref-dir parameter to init call. This may be preferable, as you can then place it anywhere you want, for example C:\Users\you\APPDATA\Duplicacy would be a good location if you plan to backup only your user folder, or c:\ProgramData\Duplicacy if it is system wide.

You have two options here.

Use the storj:// protocol,
or usr s3:// protocol.

Since this is a home computer on the home network I suggest using s3. You can create s3 credentials on Storj web side, they will give you all the parameters you need to specify for the s3 connection, which you will configure following the format described in the Storage Backends · gilbertchen/duplicacy Wiki · GitHub under Amazon S3 section.

You would need to create a bucket first on there storj web site. There you have to choose an encryption key for your data. Since duplicacy encrypts its data anyway it’s kind of redundant, but storj is always end-to-end encrypted, so you have to pick an encryption passphrase. Create another diceware passkey. You will need to use it once to create S3 credentials. When using S3 gateway, the gateway has the encryption key, so it’s no longer end-to-end encrypted, but we don’t care – duplicacy encrypts data anyway.

Have a look at this – maybe some bits will be useful as a reference: How to use Duplicacy to backup securely to StorJ cloud storage | Curt Warfield

No need to start over. You can just move that hidden .duplicacy folder into the correct directory.

set allows to save variables like credentials and access keys to a file .duplcacy/preferences. Duplicacy should be able to save them to your windows keychain, so you probably won’ need to do it manually.

When you create a backup task in your Task Scheduler you would tick an option called “Run as administrator” or something like this. Or if you wan to run it manually in the terminal – run the terminal (or power shell, or cmd) with elevated permissions.

Nope, looks like everything is covered.

To sumarize

on storj satellite:

create bucket (Save passphrase)
create s3 credentials (provide passphrase, choose which bucket to allow access to and gives full access to that bucket)
Save s3 credentials.

On your PC, in the elevated command prompt (to use vss):

init duplicacy under C:\Users\ or c:\Users\you, or wherever else,
Provide the s3 connection parameters to the init string,
run duplicacy backup -vss

On your PC, in Task Scheduler,

follow the wizard to create a periodic backup task that would CD into the directory duplicacy was initialized in, and run duplciacy backup -vss periodically.

Then, once this is all working, you can configure another schedule to run duplicacy check – that ensures that all chunks it expects to be on the storage are indeed there, and prune – that will thin the backup history according to some plan discussed in one of the previous messages, so you don’t end up with thousands of revisions after 5 years of hourly backups :).

Edit. If you are going to init duplicacy under Users or Users\you, I would recommend configuring filters, to exclude some files or folders from backup. Stuff like Temp folders, Browser caches, etc. There is no reason to waste time and bandwidth and storage to backup those. Include Exclude Patterns · gilbertchen/duplicacy Wiki · GitHub

saspus · 19 September 2023 05:22

Missed that. Arq has two offerings. you can buy a program and use it whatever storage you want. You get to use it forever and get a year of updates. That’s what I use. I don’t pay for update every year. IN fact, I did not update for the past three years, and only updated few months back – because they added a feature I found useful (backup of ephemeral, cloud-only files).
Another option, you pay about the same amount of money annually, but get 5 licenses, and 1TB of included storage. You can buy more storage at 1TB increments. It’s an interesting proposition if you have many computers with not much data each.

This did not work for me – I prefer one license, and use cloud of my choice. I have one Mac, that syncs data for all users from iCloud, runs Arq, and backs it up to Amazon deep archive. All other family member’s Macs, phones, iPads, and what not simply keep all data in iCloud.

edgardiego · 19 September 2023 08:11

I was thinking of it as a way to improve performance for example in the case of a business with multiple subsidiaries in different continents. It is indeed something good to have in any case and I’ll try to take a look at that paper, I’m curious about some aspects of it.

Bitwarden allows to create pass without including special symbols, I went for an alpanumeric pass of around 20 digits, which may be a bit overkill.

About the rest I think I made a few mistakes, and after your explanation I would have to:

As I said I init the wrong directory, I should move the .duplicacy folder from C:/User/myuser to D:/backukfolder. I have 4 disks on my PC, firstly one SSD for SO and job related programs, a seconds SSD as a gaming library, and lastly an HDD RAID0 (D:) which I use to store data. This last one is the one I want to backup. Here I also have films and similar objetcs I don’t need backed up hence the reason for backing just one folder, In addition it simplifies the process avoiding the need for filters. By moving the .duplicacy folder all chunk size and passphrase would be the same? Will therefore D:/backukfolder be considerer the default storage?
I created the bucket, but I copied the credentials created by using the “API Key” option in the STORJ access area. I’m guessing that this would be the STORJ protocol credentials, as a result I would have the “S3 Credentials” option instead. From what I know duplicacy creates a document in the storage destination, so maybes would be recommendable that I delete the bucket and create a new one?

My first idea was to delete the bucket and the .duplicacy folders and just start over, but I wasn’t sure if it created config files elsewhere.

I think overlooked that option during the set up process.

I find that to be the most appealing option too. What I don’t get is which is the benefit of using it, I mean, It is appealing for someone like me that is not well versed in the use of CLI and want something easy to use, which is for sure not your case. What am I missing out?

saspus · 19 September 2023 17:49

Yes. Duplicacy runs entirely on the client, and the configuration files is all it needs to be able to connect to storage and know what to backup.

So I assume you connected duplicacy to your bucket using storj://… url string.

No, you don’t have to delete a bucket. You just need to create new S3 credentials, then open .duplicacy/prefernces files in a text editor and replace the storj://… url with the new s3://… url.

As long as duplicacy has access to storage it will work as if nothing happened. You can move that storage to another provider, local NAS, whatever, change the URL in preferences, or simply add a new storage with the new location and duplciacy will recognize is and continue as if nothing happened.

No need to delete anything. Of course, you can delete and start over, but why, when all you need is replace an URL

To answer your question, yes, .duplicacy folder is the only thing it writes stuff into, for better or worse.

You don’t need this, because on windows duplciacy can use system keychain and store credentials there. It’s more secure than having them plaintext in the config file.

For me it were three features that duplicacy is missing, but I hope will get sooner rather than later. I don’t care much about UI – but yes, Arq has a very good UI, so it’s a plus.

The features are as follows (most specific to macOS)

Support for Archival tier storage. I don’t want to pay for hot storage, I want to use Glacier Deep Archive for my backups. Duplicacy does not support this today.
Support for backing up user-mounted filesystems. For example, if you have multiple users logged in, and one of the users mounted the disk form the server with their credentials, and another user did the same with their own credentials. This can be a network filesystem, or virtual one, does not matter. I want to be able to backup those. Arq allows to do that using helper process impersonating each user and pumping data to the main process. With duplicacy this is impossible to achieve.
CPU throttling. I’m using a laptop, and often when it’s at 5% battery charge left, I can work for hours. But if duplicacy kicks in it runs at full throttle instantly sending laptop to sleep. Arq allows to configure amount of CPU usage.

These were dealbreakers for me, especially the first one. Otherwise duplicacy is perfect: it supports Time Machine exclusions, is very fast, has powerful, if with a bit steep learning curve exclusion engine, but it does support Time Machine exclusions.

The last one you can workaround but CPULimit utility. But the first two – need to be addressed in the app.

There is a fourth feature, that Arq implemented, that prompted me to pay for update, and I find it extremely useful. I filed a feature request for Duplicacy here: Support for dataless files

edgardiego · 19 September 2023 22:54

I ended up starting from scratch, I could not find in preferences a way to input the Access Key and the Secret Key. In addition, I had to work a bit around the URL, for some reason it required me to include a region, even though STORJ is by definition decentralized. I ended up using EU1 and It seemed to work.

My next step is to run the backup, where I’ll use -vss and a number of -threads I have yet to decide.
Regarding this aspect what I understand so far is that an increase of threads would cause a proportional increase of speed, till the top of my current bandwidth. Right know I’ve got a symmetric 1 Gbps net, maybe less since I use a VPN. What is the common approach to determine the optimum number?

I assume that this would be cheaper but with some restriction to retrieve the data, probably not a problem when talking about backups.

In my case this would not be a problem since I’m using a full tower PC, however I imagine that the use of resources could slow the PC while the backup is taking place? Is there a way to prevent this? It must be annoying if it happens while I’m doing some heavyload job or gaming.

To be honest, Arq7 seems like a good option for someone like me. It would have been a lot less painful at least
However, I’ve seen some negative comments in the forum, specially talking about the Windows version, so at this point I’ll probably just stick to Duplicacy.

Thanks again!

saspus · 20 September 2023 03:01

Oh yea. I have seen it with other services too, like Minio, you have to provide region even if the server ignores it.

There are two questions to consider. Do you actually need backup to go full speed? It shall be probably a background process that does not interfere with anything and slowly does its job.

To determine optimal number — keep increasing thread count until performance no longer increase. These cloud storage services are designed to work multithreaded. For example, Backblaze provides about 10Mbps per connection. So if you want to upload faster — you have to use multi be threads.

You can use duplicacy benchmark command to test various thread parameters quickly.

I did not realize you have fiber. You can actually try playing with native storj integration, Depending on where in the world you are it may actually provide you much better throughput, and your fiber box is less likely to succumb to many thousands connections storj client will be making. Note, with s3 when you say 10 threads — duplicacy will upload in 10 threads and there I’ll be about 10 TCP connections to s3 gateway.

With storj native integration each “connection” is parallelized internally by storj library (called uplink), so the transfers are inherently parallel. If you specify more threads on top — it multiples that effectively. This is where consumer cable modem fail. But your fiber box might survive. Worth a shot, if you are after absolute performance. Here are some measurements results: https://www.storj.io/documents/storj-univ-of-edinburgh-report.pdf

I doubt that. You can always set one thread or schedule backup at night. Windows Task Scheduler I think allows to program machine to wake, run the task, and then go back to sleep.

You should try it. There is one month trial.

From one hand, there are negative comments about everything, people never agree on anything.

On the other hand, Arq was Mac application and then ported to windows, so maybe there is some truth to it.

Also arq5 was horrible garbage — perhaps those comments relate to that? Arq7 is quite awesome.

But I agree, I think duplicacy will cover your usecase pretty well. It performs better , and free for personal use. So as long you don’t miss any features — it’s a clear winner.

edgardiego · 20 September 2023 17:21

Probably not, maybe just for the initial sync in order to avoid having the PC running for a few days.

Here in Spain is really weird to find cable modems anymore, maybe in really small towns (<500 habitants), almost all <7-10 years old installations are fiber. I thought it was kind of standard elsewhere too. However, I’ve got a really basic ZTE router provided by the ISP company, which I’m not sure if can be a limitation.

I’ll have to check that, I assume that for small changes using one thread can be more than enough without impacting performance, therefore avoiding starting the PC at night just to run a backup (probably not healthy hardware wise). I could manually launch a backup using more threads when a big pack of data is included.

As you suggested, I’ve been playing a bit with both protocols. One thing that confused me when setting the STORJ protocol up is that duplicacy asked for a passphrase (which I though was the encryption key) and after for a password for the storage. What is each one for? I actually think that the last is actually the encryption key, but did not find anything conclusive.

Probably my method is far from precise, but I’ll leave the results in case someone finds them useful:

From this I would say that for S3 the optimal would be to just use 5 threads, however I did not test if increasing a lot the number of threads above 5 would surpass the drop shown above it.
In the case of STORJ I can’t tell if the router got saturated after 20 or if it’s actually its ceiling. After 25 I tried with 18 and it gave worse results than 15, next attemp with 19 simply gave an error: Failed to upload the chunk: uplink: metaclient: rpc: tcp connector failed: rpc: context deadline exceeded.

My conclusion is that I could go either for S3 with 5 threads or go after full performance and use 20 threads with STORJ protocol. Any advice in this regard?

Really appreciate that you are taking the time to answer everything.

saspus · 20 September 2023 18:24

I live in California, the middle of a Silicon Valley, no less. We have Google Fiber, AT&T Fiber in parts of the area, but, of course, not where live specifically. I only get access to cable, and because it was designed for broadcast, the upstream is horrific by design. Hardware simply does not have enough transmit units. So I have 1000mps downstream (they offer up to 2000mbps) and 30mbps upstream. It’s the absolute max available. To their credit, they are upgrading network now to support up to 200Mbps upstream, using draft of new standard, as far as I understand, that allows modems to have configurable transievers that can be dynamically allocated between up and down stream, that some equipment vendors seem to already support, and it’s a huge deal. Can’t wait until they finish upgrading in my area. I use my own network equipment, Ubiquiti routers and access points, but the modem is ISP issued. That beast consumes 20W all the time doing practically nothing, and is not very stable. But it’s more stable than alternatives, so… I guess I don’t have much choice /rant.

The passphrase is likely to unlock the buckets. All buckets on storj are always encrypted. When you generate the S3 credentials, then the gateway holds the encryption key and gives you S3 credentials in stead. But if you use native storj client – then you fully control the key.

This is very nice! (Btw, discourse forum supports pasting images right into the body of the message. I.e. you take a screenshot into a clipboard and just paste it into the message editor. The forum will upload the image and insert markdown link.)

Few thoughts:

Did you measure this with benchmark command?
Was it with the default chunk size? It would be interesting to see comparison with default vs 32Mb vs vs 64Mb chunk size. I’m wondering if you specify max chunk size == min chunk size == 64Mb, will duplicacy actually create 64Mb files, or will it overshoot slightly? Storj segment size is 64Mb, so the closer you get to that the better will be performance (and slightly cheaper storage – per-segment fee is very small but non-zero). You can provide chunk sizes to the benchmark command.
I don’t remember if benchmark command allows to specify how much data to upload – because your connection is fast, and if the amount of data is quite small, the transfer time may have high variance.
It’s not entirely clear why is there such as sharp drop with s3 after 5 threads. It’s rather weird.

Yeah, that’s your network equipment having a stroke :). You can start ping in the new terminal window to google.com, and that would start failing as well.

I suspect that higher chunk size should provide much better performance with native storage integration even with smaller thread count. It feels like majority of time is spent setting up the connection, and not actually transferring chinks.

Storj has been working on implementing TCP FastOpen to shorten the handshake, but I’m not sure what’s the state of affairs on windows. I think today it’s only working on linux. (I’m using FreeBSD and it’s not implemented there either).

It feels from the numbers that S3 performance is hindered by something – 6Mbps at 5 threads is quite low. Perhaps it’s a combination of small chunk size exacerbated by high latency to the gateway. Can you ping gateway to measure latency?
If the gateway is “far” – then it’s another reason to perhaps use native integration. Especially since you are already getting better throughput. Increasing chunk size should let you achieve similar speed with much less than 20 threads. But if this is measured with already large chunk size – then I"m not sure what’s going on.

I’ll try to measure at home again to compare.

edgardiego · 20 September 2023 19:35

I’m very surprised to hear that to be honest. My parents live in a town of less of 3000 inhabitants and they also have 1 Gbps, symmetric for upload and download. On the other hand, I live in Madrid, where I could get up to 10 Gbps if I wanted, I’ve got 1Gbps for 30 bucks a month, which is more than I need to be honest. I would expect things to be much more advanced in the states. I’m glad that at least they are upgrading the network.

And STORJ passphrase is the same you set up on the web when creating the bucket? The one it asks if you want to be the same as the project or not.

I know! I just did it that way since I did not see any “spoiler” option or something like that. Next time I’ll probably just paste it.

Correct, this morning I did the S3 measurements, varying the -upload-threads and -download-threads. In the afternoon I did the same for the STORJ protocol.

I did not apply further instructions regarding chunk size. The storage init was with an average chunk size of 32 as you recommended, nothing else.

How could I know if it is overshooting? From the documentation I see that in the init command it can be added:

   -chunk-size, -c <size>          the average size of chunks (default is 4M)
   -max-chunk-size, -max <size>    the maximum size of chunks (default is chunk-size*4)
   -min-chunk-size, -min <size>    the minimum size of chunks (default is chunk-size/4)

Maybe with -max 64 and -min 64 it can be forced to always use 64 MB chunks.
In the case of the benchmark I think it could be done:

-chunk-count <count>         the number of chunks to upload and download (default to 64)
-chunk-size <size>           the size of chunks to upload and download (in MB, default to 4)

It allows it, but I left it by default, which is 256 Mb.

I believe so too After that I could barely browse on Google, so I turned off the router for about 10 seconds for it to restart. How could I benchmark it extensively without this problem? Should I have to restart it everytime?

ping gateway.storjshare.io reports 116 ms average. As said, I’m using a multi hop VPN which could being making it slower.

I could try making some measures if you think it can be interesting. Just let me know!

saspus · 20 September 2023 21:15

That’s a direct consequence of absolute vastness of the country and low population density – any infrastructure projects are ridiculously hard to finance. Including transportation. It’s quite a self-inflicted issue; San Jose, for example, the largest metro are in the South Bay, is zoned for single-family housing exclusively. It makes no sense. Youtube channel Not Just Bikes has a few videos on this topic.

I’m glad you asked!

What is encrypted on storj buckets are object. Files. The illusion that project, or bucket have an associated passphrase is just an illusion, for convenience managing buckets on the web. They don’t mean anything, they are just defaults.

They overhauled UI recently, because before it was even more confusing – it was making much stronger impression that it’s the bucket that is encrypted, and to see files you can decrypt the bucket. This is not true.

You can, in fact, have multiple files encrypted with different passphrase in the same bucket. Then when you list the bucket – you will only see files that can be decrypted by the provided passphrase. Or none, if passphrase is wrong. This is a quite a useful feature: for example, if you are developing an app that stores user data – you can use single bucket and dump data from all users into it encrypted with user-specific passwords and all the security separation is handled automagically.

So answering your quesiuton – passphrase can be anything. I usually just put random garbage when I create bucket – because I’m not going to use that passphrase. I then go generate credentials, and it asks for passphrase again – that’s the one I’m going to create and save in my password manager. Because that’s the encryption keys that will be needed to access files uploaded using that credentials.

Does it make sense?

I don’t think it’s in the menu – but this works:

[details="caption"]
 .. content ..

[/details]

meow

No, now you know it’s the limit – so don’t approach it

Yeah. That’s the reason. 116ms is HUGE. On every piece it spends 116 ms multiple times! to establish TCP connection and then transfers data very quickly and starts over. It’s huge overhead. I get 15ms to pretty much anywhere, this would explain the difference.

Storj also supports QUIC, but many ISPs aparently block it. Vast majority of traffic happens to be TCP

edgardiego · 20 September 2023 22:35

I find it pretty interesting, will check that video. Here in Spain we also have a lot of places that are almost abandoned, population is quickly accumulating in the big cities.

Therefore the web passphrases are just to access the web. The important one would be the set in duplicacy, which would encrypt the objetcs uploaded to STORJ. In addition to the passphrase also set in duplicacy for the storage itself, which would encrypt everything but on the client side.

It is cumulative? I mean, 20 threads would be the top for STORJ protocol, but doing multiple benchmarks with 19 or 20 threads would reach the limit? What prevention should I take not to take it on a daily use.

I’ve tryed changing the VPN to a simple one instead of multi hop, that caused the ping to drop to 31 on average. Could be one of the reasons.

What would you do next? Try adjusting the chunk size somehow to use the STORJ protocol or stick to S3 and do some more benchmarks to see if the reduced ping improves something?
What intrigues me is why STORJ protocl does not seem to be affected that much, is because the QUIC support you mentioned?