Duplicacy caused me $122 USD of cloud storage cost (file operations) in just 1 month. Arq 6 cost $22 USD for 6 months. How to fix that?

I have been testing Duplicycy for pretty much exactly 1 month now (had to buy a license yesterday). Using the Web UI.

I am using it to backup around 10 TB of Data, from my HDDs, to Google Cloud Storage with the “Archive” class. I’d think that’s the ideally class for backup storage: Very cheap for storing stuff, and only expensive when you actually want to retrieve anything, which you don’t plan to ever do, ideally.

Google Cloud Storage Archive costs 1.2 $USD per TB per month. So my 10 TB would be around 12 USD per month. Additionally, every “operation” on the cloud storage costs $0.50 USD per 10000 operations (An operation is an action that makes changes to or retrieves information about buckets and objects in Cloud Storage.).

But Duplicacy seems to cause way too many individual file operations, which are very expensive. Also, Duplicacy seems to cause a surprisingly big amount of download activity from the storage, which also is very expensive ($160 USD per TB).

Before using Duplicycy, I was using Arq 6, with the exact same storage backend. Arq caused way, way, way less operations, and no “download” at all, so Arq was very cheap on the storage backend.

Here’s some data:

Duplicacy has only finished around 1/3 of the initial backup yet, so around 3.3 TB.

In just this month (December), where I have only used Duplicacy, There have been 1.4 million “operations” on the cloud storage, costing $72.2 USD. Additionally, Duplicacy seems to have downloaded a total of 282 GB from the storage, which cost $50 USD. On top of that, around $5 USD for storage, which is fine, and cheap.

But the 1.4 million operations, combined with the download of 282 GB of files, cost a total of $122 USD! There is no way how I can afford this going forward, this is just 1 month.

In comparison, here are the stats for when I used Arq 6, the months before. Also consider, my Duplicacy backup only has uploaded around a third of what my Arq 6 backup has uploaded, as that fully finished in the time!

In the months from May till November, Arq 6 has caused 423000 “operations”, costing a total of $22 USD. Additionally, Arq 6 has used a total of 0.49 GB of “download”, costing $0.02 USD.

I did start a new backup with Arq 6 in that timeframe, same as I did with Duplicacy now. So it’s a very fair comparison between Duplicacy and Arq 6. I have never tried to restore any data, neither with Arq 6 nor with Duplicacy. Arq 6 has caused me a cost of $22 USD in file operations over the course of 6 months, while Duplicacy has caused me a cost of $122 USD within just 1 month.

Duplicacy is obviously completely unsustainable in this case. Is there anything that can be done to improve that?

I guess I could try to increase the chunk size? I am using the defaults currently. I do not care about deduplication at all, my data is all fairly unique anyways. I just want to get Duplicacy to cause less file operations, and also not download anything unless I actively restore any data. Is that possible? What do I have to change?

Duplicacy was not designed to support cold storage, mostly because of the need to download metadata chunks even for non-restore operations. On top of that, for each chunk to be uploaded Duplicacy needs to check if the chunks already exists in the storage, which means 2 operations for each upload.

You can reduce the number of operations by using a larger chunk size, but if I were you I would stay away from the Archive Storage class. In case you need to restore all your 10 TB data you’ll need to pay $1,600 for the download.

Thanks for the reply.

Download is not only expensive with cold storage though. Yes, the $160 USD per TB of Archive storage is the most expensive it gets, but even the most “hot” storage that Google Cloud Storage offers still costs $110 USD per TB to download (egress): Cloud Storage pricing  |  Google Cloud

The additional $50 USD per TB for Archival storage on top are not the problem, the problem is that downloads always are expensive.

And this isn’t just expensive on Google Cloud Storage, with regular Amazon S3 downloads also cost $90 USD per TB. I think Backblaze B2 is the only storage backend that offers quite cheap downloads at only $10 USD per TB (or is there any other?). But it would be weird to pick a storage backend by which one offers the cheapest downloads, if the goal is to backup data for a long time, without ever really downloading it again… The main consideration should be the cost of storing data, while using some efficient software that minimizes operations and downloads.

Is it not possible for Duplicacy to cache some more data locally, similar to how Arq 6 works (I assume)? If Arq 6 can backup without ever downloading anything, shouldn’t Duplicacy also be able to do so?

Would using a bigger chunk size with Duplicacy also reduce the amount of data it downloads, or would that only have an effect on the amount of operations its causing?

Wasabi might be an option for you to consider: its pricing is based on the storage used and there are no egress or API fees (at the time of writing). There is some “small print” about how deleted data is handled, which does affect costs when using Duplicacy but at least in my case has not proved to be too significant.

1 Like

Yes, Wasabi is an option, I have used it in the past.

I think the two best options are:

Wasabi:

  • storage: $ 5.99 TB / month
  • egress: no fees
  • API requests: no fees

Backblaze B2:

  • storage: $ 5.00 TB / month (~20% cheaper)
  • egress: $ 0.01 / GB
  • API requests: some, but the most common transactions for backup are not charged.

Note: on B2 there is the option to download free of charge via Cloudflare, with some limitations.

Today, considering that the monthly cost of storage is the main one, and hoping I’ll not need to download, I use B2, which has a lower monthly cost.

2 Likes

Duplicacy does cache metadata chunks locally. If you only run backups and never clean the cache manually it should not download any chunks.

The main design goal of Duplicacy was to support cross-client deduplication which Arq can’t do. It is possible to handle metadata chunks differently in order to support cold storages, but this was not considered in the early stage.

  1. That sounds interesting. I did delete my cache once manually, as that was suggested to fix the “missing chunks” issue. But the cache was definitely not 282 GB in size, and Duplicacy has caused 282 GB in downloads, so something doesn’t seem right there I think. Should Duplicacy really not cause any download except for when some metadata is missing from the local cache?

  2. If I’m understanding you correctly, you’re saying that if I never manually delete the cache, and if I use a chunk size that’s the same size as Arq 6 uses for it’s chunks, then the only difference in cost on the storage backend would be that Duplicacy needs 2 operations for each upload, where Arq 6 might only need 1? Correct?

  3. And that “check if the chunks already exists in the storage” can not be done based on local cached metadata to avoid the second operation? I understand that would be needed if I cared about cross-client deduplication, but I just have one PC, so I only care about backing up from this one PC. And it sounds like it should be easily possible to disable the on storage backend check for if a chunk exists in the case that the user does not use any cross-client deduplication, then the local cached metadata should have all info that’s needed?

You also run the check operation which needs to look at (partial) metadata chunks of every revisions.

That is correct, if you only run backups, not other operations like check, prune, and copy.

You can benefit from cross-client deduplication when you back up multiple directories on the same PC to the same storage location. Duplicacy is based on a database-less architecture, which I believe is far superior to the conventional approach adopted by Arq and others. It has not been optimized for cold storages as of now, but in the future it will.

2 Likes

Ok, interesting, thanks!

Then I guess I’ll be looking forward to the update that optimizes it more for cold storage.

Until then, I’ll switch from “Google Cloud Storage Archive” to “Google Workspace Enterprise Standard”, which offers unlimited cloud space for $20 a month. That seems to be the cheapest way to store 10 TB of data without having API operation costs and download costs. It seems like a better offer to me than Wasabi or Backblaze B2, as there I’d pay at least $50 or $60 a month, and not just $20 a month @towerbr @IanW