I don’t see it. Can you share the link? Rclone can’t enforce anything, it’s a sync tool. It just copies files.
I would say duplicacy creates a bunch of files and tell the remote to store them. In case of storj – duplicacy passes the file to the storj library that takes the file and uploads it. How it does it – is beyond duplicacy’s knowledge. Ultimately, yes, your computer makes multiple concurrent connections to the storj nodes and uploads erasure encoded pieces. This decentralization what provides ridiculous performance and resilience.
This would be a bottleneck, and this is precisely what Storj maintained S3 gateway does. And this creates a single point through were your traffic goes, that limits throughput. Also, your client will end up uploading to the fastest nodes, while centralized solution would not.
Anyway, it’s an implementation detail, and does not matter from duplicacy standpoint.
That size is quite small. You should be able to tweak it to at least an order of magnitude larger.
Limit is there as a failsafe. You can ask them to remove it for your account. But ultimately, the problem is too small file size. What if you configure duplicacy to use 64Mb fixed sized chunks?
This is a good question – is the chunk size specified is before or after the compression? @gchen? I think it woudl not make sense if it was before, but let’s confirm.
Search the forum for “IEEE paper”. You’ll find the link to pdf.
Obviously, there is some optimum value for your specific dataset somewhere between 1byte and 2TB chunk size. Whether it makes sense to waste time finding it – is another issue. Set fixed chunk size 64MB, or variable chunk size, e.g. from 32MB to 64MB, and see how it works for your dataset, and more importantly, whether it is good enough.
Why woudl it affect deduplication in any significant way?
Yes, I have used a few:
- Google Workspace storage: slow, high latency, but unlimited storage at fixed prices, and therefore essentially free.
- OneDrive: slow, only tolerates 2 threads, limited storage, expensive.
- Dropbox – similar to the two above.
- B2: Hot storage, relatively cheap one at that, but hot storage is overkill. Known to occasionally have weird bugs – like returning bad data via API, which is not tolerable at this stage of service maturity. they have two datacenter, provide 10Mbps per stream, and operate somewhat similar to what we discussed above – where the client uploads data directly to endpoints provided by the load balancer; they also provide S3 endpoint for compatibility which does everything at the expense of performance.
- AWS S3 – this is what I use. Not the Standard tier, which is hot and expensive, but archival ones. Duplicacy only supports tiers with instant retrieval – so for now I"m using different solution that supports AWS Glacier Deep Archive. When duplicacy supports it too – I will maybe switch back. Today I’m paying about $3/month in AWS fees backing up about 2.5TB of data from my Mac. Full restore will cost me way more than that, but it’s an insurance policy. Small restores – few files here and there (up to 100GB/month) are free.
- Google Cloud Storage – more expensive than AWS, and with some unusual cost structure. I did not have enough patience to try to figure it out.
So, back to my point – using hot storage for backup is wasteful. If cost is a concern, hot storage should not even be on the list of contenders.
If you find hot storage with good enough price that you don’t mind paying – sure, use hot storage. For example, if I used B2 my backup would cost me $12/month or so. $12 is 400% of $3, but both numbers are too low to make a big deal over. If you had 100TB to store – it would $500 vs $100, and maybe in this cases saving $400/month maybe would be worth considering switching.
In summary, with a small amount of data (e.g. under 10TB) I don’t think price should be deciding factor, because 10x of a very small number is still a very small number. Instead, pick solution by any number of other reasons:
- performance
- architecture and ideas behind it
- ease of use
- quality of support
This was driving my choice of AWS Glacier Deep Archive for backup and Storj to store day to day data. Both also happen to be the cheapest in their respective classes, but that wasn’t the goal, it just turned out so.