Should I change the default minimum, average and maximum chunk size (includes existing chunk analysis / Storj)?

UserDup · 21 November 2022 11:47

I have carried out a quick analysis on my file size of existing chunks going back years and the spread can be found below (approx 100,000 chunks). I am surprised by the number of small chunks.

As my possible future storage provider (probably going to start fresh so can change chunk sizes) puts the same “cost” on files up to 64MB I am wondering if I should be increasing the three default chunk sizes and what I should be changing the sizes too?

I believe this could especially be important as there is also an additional limit on the total number of smaller files that can be stored.

I would say I am reasonably typical and the backups includes documents and various data and everything goes into a strict tidy directory structure.

There are a few 50-100MB files but most are just small documents/data that do not change but every so often a new document may be changed/added in a directory.

Thanks

Chunks
<1MB	    19.22% - even though minimum 1MB (default) there are many chunks less than this
1MB-2MB	    21.80%
2MB-3MB	    15.90%
3MB-4MB	    11.40%
4MB-5MB	     8.06%
5MB-6MB	     5.76%
6MB-7MB	     4.89%
7MB-8MB	     3.05%
8MB-9MB	     2.24%
9MB-10MB     2.32%
10MB-11MB	 1.40%
11MB-12MB	 1.07%
12MB-13MB	 0.68%
13MB-14MB	 0.56%
14MB-15MB	 0.49%
15MB-16MB	 0.52%
>16MB        0.66% - even though maxmimum is 16MB (default) there are chunks bigger than this

gchen · 22 November 2022 04:16

Chunks are compressed so their sizes are reduced. If a file is larger than 16MB, it could be that the content isn’t compressible at all (video file?) and there is a header which adds dozens of bytes. In any case it should not be too much larger than 16MB.

UserDup · 22 November 2022 23:46

Thanks if they are compressed that would explained why my average file size of chunks seems to be much less than around the 4MB mark. A good portion of my backup must be able to be compressed. There are a few zip files in there so that would explain the few slightly larger than 16MB chunks.

Has anyone got any pointers on changing the chunk size?
As it appears my data can be compressed I am thinking perhaps a doubling of the chunk size from 4MB to 8MB should be reasonably risk free?
Not sure if this will help with the increasing the overall file size of chunks though especially if the data is compressed after the chunking?
Is there a command that can output the current chunk size settings?

Thanks

saspus · 23 November 2022 00:56

init command manual:

   -chunk-size, -c <size>          the average size of chunks (default is 4M)
   -max-chunk-size, -max <size>    the maximum size of chunks (default is chunk-size*4)
   -min-chunk-size, -min <size>    the minimum size of chunks (default is chunk-size/4)

saspus · 23 November 2022 01:06

This is highly unusual. What type of data do you have?

Generally, for most users, bulk of data is immutable and incompressible, and small amount of data is highly volatile and compressible (think family photos and videos vs excel spreadsheets). (Unless you are dealing with application-specific sparse files of course)

For the former, perhaps archiving the media would be a better approach: the only reason you want versioning for that is to be able to recover form bit rot and user error, but if you simply copy that data into e.g. huge disk image and uploaded it as is you eliminate that risk and avoid slicing the data into chunks altogether. For the rest — duplicacy with default chunking should do.

Another approach would be crank up duplicacy chunking to fixed sized 64MB chunks (depending on the type of data you backup this may be better from overl performance perspective; see for example Vertical Backup)

Or maybe it’s not worth optimizing at all. It all depends on types and amounts of your data, backup frequency, turnover rate, etc

saspus · 23 November 2022 01:12

You mean additional fixed “segment” fee. You still pay according to the actual amount of data stored. 32 mb will costs half of what 64mb costs. However, to store each piece there is some amount of overhead involved regardless of the piece size. Per-segment fee is to account for that. It’s negligible if you store large pieces, but if you store millions of files only few bytes long— that overhead comprises the bulk of usage. In other words, there is a difference between storing a terabyte in 4K pieces vs 64M pieces, the former wastes a lot more resources.

Ultimately, for most users that cost is not significant enough to bother.

sevimo · 23 November 2022 01:53

You want versioning for media if you frequently reorganize your folder structure and/or modify tags. Both are not uncommon with large media collections.

saspus · 23 November 2022 02:28

I’d argue this is a job of media library software to deal with this. Why would you modify or move media files themselves? Once they are dumped to a folder, it’s done, they are indexed and stay there forever. Media library can present you different slices from there. Furthermore, tags should be stored in the library along with the rest of metadata, not source media. (There is no reason to modify 50Gb video file to add a video caption: cramming partial functionality of the media library into the current filesystem sounds like ann uphill battle ).

Once we agree that media is immutable — versioning is not necessary.

sevimo · 23 November 2022 14:57

I think we all know that you believe that there is a single right way on how to do things, and nothing else should be supported. Many people, however, disagree with that approach. I’ll leave it at that.

UserDup · 23 November 2022 15:54

saspus:

init command manual:

   -chunk-size, -c <size>          the average size of chunks (default is 4M)
   -max-chunk-size, -max <size>    the maximum size of chunks (default is chunk-size*4)
   -min-chunk-size, -min <size>    the minimum size of chunks (default is chunk-size/4)

Thanks I was more going for pointers on whether I should be adjusting the chunk size rather than how to set a different chunk size?

This backup is “important” stuff. Documents, desktop, data dumps and few really important apps/zips. Probably all apart the zips quite compressible.

UserDup · 23 November 2022 16:37

I am getting concerned that I think Storj may not be the answer? I am probably really misunderstanding!

Looking at the rclone documention for when using Storj it enforces a minimum chunk size of 64MB (I do not know if this is before or after compression) and there seems to be a S3 or Native API upload option and the Native API (which I think Duplicacy uses) actually uploads 2.7GB for every 1GB of filesize! I do not have the fastest upload speed but it is acceptable but uploading taking 2.7x longer? No idea whether this means I will be using 2.7x the storage also and what is erasure-coded locally!

Additionally if you have many files you have a limited segment allowance am I going to get to a stage a few months down the line and Storj is not going to allow any further uploads even on a paid tier which is probably not needed for many months? On the free tier it mentions 10K segments/month that doesn’t sound much?

I would need to have 100,000 to support my current backup revisions so I will have to start a new backup as it is so I was thinking if I can change the chunk size to better suit Storj I should do but now

I am thinking Storj is too good to be true

Thanks

saspus · 23 November 2022 18:26

Which means there is a relatively small amount of this data, and hence probably the difference will be minimal to be worth worrying about.

saspus · 23 November 2022 18:57

Maybe :). I don’t use storj for backup myself. Inspite of rumors I do think there could be more than one solutions, and backup to hot storage while possible is counterproductive. Be that b2, AWS. S3 standard, or storj.

Storj does not enforce anything. If the file is larger than 64 it is split into multiple segments. If smaller — stored as is. It’s an implementation detail.

Correct. Each piece gets erasure coded and uploaded to 80 separate nodes across the world. You do use more bandwidth but accomplish very high throughput. Remember, storj is hot storage and optimized for performance and resilience, not cold data storage.

You could use s3 endpoint and avoid that egress amplification, but then the s3 gateway becomes a bottleneck. Again, probably ok for backup but had other drawbacks. Or you can use both. Do initial upload with s3 and then delete the credentials. You’ll need to ensure that you use the same passphrase.

If you upload 1GB it takes 2.7GB to store on the network and you pay for storing 1GB. It is not different when you write data to RAID1 array: to store each gigabyte you need 2GB worth of disk space.

Without specific numbers it’s impossible to decide: how much does data changes daily, vs what is your upstream bandwidth. Maybe all the differences are uploaded in 5 seconds, so making it even one minute won’t matter.

Again, let’s talk specific numbers. Here is how segments work: Usage Limit Increases - Storj DCS Docs

The limit is there for you to accidentally not to end up with millions of segments and large fee. You ask ask to remove the limit. But for most users this won’t be a problem.

Also that page explicitly says this:

For most users and most usage patterns, we do not expect a Per Segment Fee to be charged. Only when large numbers of segments are stored relative to a disproportionately small amount of data do we expect there to be a Per Segment Fee. Only use cases with large numbers of very small files or large numbers of very small Multipart Upload Parts are expected to be subject to the Per Segment Fee.

(Maybe storj forums will be able to answer these questions better?)

Set chunks size 64Mb if you want to further optimize.

How did you arrive to this number?

Let’s consider 32MB average chunks size. 100k segments translates to 3TB of data. It costs $12 to store 3TB, plus $.88 segment fee (edit, which looks like won’t be charged anyway). If you pack to 64MB blocks — $0.44 segment fee. Under intended use the segment fee is insignificant.

To reiterate, each solution has its pros and cons. For storj it’s decentralization and performance. I’d argue for backup you want former but don’t care about latter. I’nm not trying to convince you to use storj — because I believe hot storage for backup is bad idea regardless of the provider; I think Amazon instant archival tiers are much better suited. Unfortunately duplicacy only supports hot storage and instant retrieval tiers of archival storage.

UserDup · 24 November 2022 22:35

This was in reference to rclone having storj support. In rclone documentation it says it enforces 64MB chunks for storj. This seems to indicate there is a reason why I may need to do the same in Duplicacy?

Getting a better average chunk size of 1-2MB which is what I am probably averaging at the moment is something I need to look at?

To clarify as I am still unsure. So Duplicacy runs a backup and creates a series of chunks it asks storj where to store them and storj replies with a list of IP addresses and then Duplicacy uploads to those servers from my PC?

I just thought it would get uploaded once to storj and then storj would handle the distribution to other “nodes” in their network.

I have 100000 chunks currently using on average 1-2MB (assuming best case 2MB) this means the maximum amount of data would be for my set of files of 200GB? I think I need to do something with the chunk sizes. Looking at the chunks I have hundreds that are <5KB!

I am not sure if increasing the chunk size will help though if Duplicacy compresses the chunks there is a reasonable chance I am still going to have really small chunks being uploaded to the storage?

Is there any documentation on how Duplicacy dedupes and chunks up data? I have no idea whether I should be just setting a higher chunk size and let the /4 and *4 apply for the minimum and maximum or what? Am I going to complete break deduplication and mean I am using significantly more cloud storage?

Have you experience of any other storage providers that may be worth looking at for Duplicacy backups? I am still unsure but storj is still high on my list. Price seems good and decentralization is great.

Thanks

saspus · 24 November 2022 23:39

I don’t see it. Can you share the link? Rclone can’t enforce anything, it’s a sync tool. It just copies files.

I would say duplicacy creates a bunch of files and tell the remote to store them. In case of storj – duplicacy passes the file to the storj library that takes the file and uploads it. How it does it – is beyond duplicacy’s knowledge. Ultimately, yes, your computer makes multiple concurrent connections to the storj nodes and uploads erasure encoded pieces. This decentralization what provides ridiculous performance and resilience.

This would be a bottleneck, and this is precisely what Storj maintained S3 gateway does. And this creates a single point through were your traffic goes, that limits throughput. Also, your client will end up uploading to the fastest nodes, while centralized solution would not.

Anyway, it’s an implementation detail, and does not matter from duplicacy standpoint.

That size is quite small. You should be able to tweak it to at least an order of magnitude larger.

Limit is there as a failsafe. You can ask them to remove it for your account. But ultimately, the problem is too small file size. What if you configure duplicacy to use 64Mb fixed sized chunks?

This is a good question – is the chunk size specified is before or after the compression? @gchen? I think it woudl not make sense if it was before, but let’s confirm.

Search the forum for “IEEE paper”. You’ll find the link to pdf.

Obviously, there is some optimum value for your specific dataset somewhere between 1byte and 2TB chunk size. Whether it makes sense to waste time finding it – is another issue. Set fixed chunk size 64MB, or variable chunk size, e.g. from 32MB to 64MB, and see how it works for your dataset, and more importantly, whether it is good enough.

Why woudl it affect deduplication in any significant way?

Yes, I have used a few:

Google Workspace storage: slow, high latency, but unlimited storage at fixed prices, and therefore essentially free.
OneDrive: slow, only tolerates 2 threads, limited storage, expensive.
Dropbox – similar to the two above.
B2: Hot storage, relatively cheap one at that, but hot storage is overkill. Known to occasionally have weird bugs – like returning bad data via API, which is not tolerable at this stage of service maturity. they have two datacenter, provide 10Mbps per stream, and operate somewhat similar to what we discussed above – where the client uploads data directly to endpoints provided by the load balancer; they also provide S3 endpoint for compatibility which does everything at the expense of performance.
AWS S3 – this is what I use. Not the Standard tier, which is hot and expensive, but archival ones. Duplicacy only supports tiers with instant retrieval – so for now I"m using different solution that supports AWS Glacier Deep Archive. When duplicacy supports it too – I will maybe switch back. Today I’m paying about $3/month in AWS fees backing up about 2.5TB of data from my Mac. Full restore will cost me way more than that, but it’s an insurance policy. Small restores – few files here and there (up to 100GB/month) are free.
Google Cloud Storage – more expensive than AWS, and with some unusual cost structure. I did not have enough patience to try to figure it out.

So, back to my point – using hot storage for backup is wasteful. If cost is a concern, hot storage should not even be on the list of contenders.

If you find hot storage with good enough price that you don’t mind paying – sure, use hot storage. For example, if I used B2 my backup would cost me $12/month or so. $12 is 400% of $3, but both numbers are too low to make a big deal over. If you had 100TB to store – it would $500 vs $100, and maybe in this cases saving $400/month maybe would be worth considering switching.

In summary, with a small amount of data (e.g. under 10TB) I don’t think price should be deciding factor, because 10x of a very small number is still a very small number. Instead, pick solution by any number of other reasons:

performance
architecture and ideas behind it
ease of use
quality of support

This was driving my choice of AWS Glacier Deep Archive for backup and Storj to store day to day data. Both also happen to be the cheapest in their respective classes, but that wasn’t the goal, it just turned out so.

UserDup · 26 November 2022 11:11

Amazon S3 in the Backend Quirks section

saspus:

I would say duplicacy creates a bunch of files and tell the remote to store them. In case of storj – duplicacy passes the file to the storj library that takes the file and uploads it. How it does it – is beyond duplicacy’s knowledge. Ultimately, yes, your computer makes multiple concurrent connections to the storj nodes and uploads erasure encoded pieces. This decentralization what provides ridiculous performance and resilience.
This would be a bottleneck, and this is precisely what Storj maintained S3 gateway does. And this creates a single point through were your traffic goes, that limits throughput. Also, your client will end up uploading to the fastest nodes, while centralized solution would not.
Anyway, it’s an implementation detail, and does not matter from duplicacy standpoint.

Out of interest do you happen to know why it is “only” 2.7x and not the logical 80x (as 80 pieces)?

gchen has already posted (quoted below) so it seems to be the chunk size before compression otherwise something is not working going off my final chunk sizes
I agree I wrongly assumed it was after compression but by the looks of things it isn’t that and why I am partially confused.
I guess having to compress everything multiple times to try to get to a target size is “impossible” as you do not know how much is will compress?
Perhaps the help could be updated to more explicitly say the chunk size targets the uncompressed chunk size?

I am going to run a few tests. Will be specific to my data but if it makes interested reading I will post my results.

Chunk size details gives the impression increasing chunk size would negatively impact deduplication (quite severely depending on how you read the details)?

saspus:

Yes, I have used a few:

Google Workspace storage - OneDrive - Dropbox - B2 - AWS S3- Google Cloud Storage
So, back to my point – using hot storage for backup is wasteful. If cost is a concern, hot storage should not even be on the list of contenders.
If you find hot storage with good enough price that you don’t mind paying – sure, use hot storage. For example, if I used B2 my backup would cost me $12/month or so. $12 is 400% of $3, but both numbers are too low to make a big deal over. If you had 100TB to store – it would $500 vs $100, and maybe in this cases saving $400/month maybe would be worth considering switching.
In summary, with a small amount of data (e.g. under 10TB) I don’t think price should be deciding factor, because 10x of a very small number is still a very small number. Instead, pick solution by any number of other reasons:

performance - architecture and ideas behind it - ease of use - quality of support
This was driving my choice of AWS Glacier Deep Archive for backup and Storj to store day to day data. Both also happen to be the cheapest in their respective classes, but that wasn’t the goal, it just turned out so.

Great summary.

So if I can get my head around the chunk size issues, the 2.7x time cost for uploading (backup takes longer uploading), everything is double encrypted (backup takes longer due to CPU usage) and assuming only hot storage solutions Storj still seems to be a “best” choice (150GB free and lower cost than most others for additional storage and you get great decentralization and data protection)?

Answering my own question but in case anyone is searching the following will display the three set chunk sizes

duplicacy -debug list

@saspus thanks for your continued patience and help.

saspus · 26 November 2022 20:04

That has nothing to do with how the data is stored, it only concerns upload and download to /form s3 of large files to improve performance. There is no reason to do this with storj.

In fact, rclone supports storj directly Storj so that s3 page is irrelevant.

Read about Reed Solomon erasure coding. The idea is to slice data into multiple pieces in such a way that allows to lose a bunch of those pieces and still be able to reconstruct original data. Storj figured what the parameters of this process shall be to provide desired overal reliability and this happened to be 80/29 or something along those lines: they create 80 pieces, any 29 out of which are enough to restore the data. Hence the overhead.

Duplicacy supports something similar to protect against rotting media because some people store backups on a single hdd that develops bad sectors and rots.

It’s relative. Deduplication ratio of 1.001 is 100x worse than 1.1 and both are still negligible.

Or it may improve overall storage efficiency by reducing amount of metadata.

This strongly depends on type and amounts of your data.

Set chunk size to 60MB ( to ensure you are under the 64MB threshold) and see how it goes. It may be good enough. Your time fine tuning this is money, and you have already spent way more than you could save by tweaking this further.

This is irrelevant as long as any new data you created that day manages to upload that day. There is no hurry. In fact, I always cpu throttled duplicacy — I don’t want it to run full speed, I want it to slowly do its job in the background. Slowly. If you stalking about initial upload — on one hand it’s irrelevant, you do it only once, and on the other hand — you can use S3 to make initial upload.

This is ridiculously small overhead. Essentially, it’s free on modern processors that support encryption in hardware. Even if it wasn’t — you would not notice a diffference: other things, like compression and filesystem latency are much larger.

Based on the justification, you meant to say “cheapest”. If you are looking for “cheapest” then any hot storage cannot be best for backup.

Decentralization and data protection: you can get it with other vendors too. For example Amazon offers multi-region storage tiers. Or you can say backup to datacenter in Ireland while living in the US — totally different continent. this backup scenario does not really benefit from storj’s decentralization because you are reading and writing from one place. And in terms of durability — there are various tiers from other provides too.

My point is — do you have enough data to justify this penny pinching? Do you have 100TB of stuff to backup? If not, and you are like the majority of users, time spend deliberating “the very best” approach brings diminishing (and fast!) returns.

And if we go the route of optimizing the cost — then cost of storage much be optimized, not cost of restore or performance. And this brings us back to archival tiers

UserDup · 26 November 2022 20:34

@saspus thanks again for your help.

Droolio · 27 November 2022 15:54

Chunk size needs to be a power of 2, so the next lowest would be 32MB.

However, @UserDup may as well just target 64MB - in my experience, most variable chunks compress lower than the max-size default, and with fixed size chunking - while you will get a number of chunks above the set chunk size - again, most will be under.

Halving the size to 32MB only guarantees it uses twice as many Storj segments (as small a cost as that may be). If per segment fee is a fixed cost regardless of size, and the goal is to use the least amount of segments, using 64MB chunks will get you the closest IMO.