Is duplicacy relevant for backup 30TB of pdf and jpeg with everyday 300 000 new files and blackblaze

Hello,

is duplicacy relevant to backup 30TB of pdf and jpeg
with everyday up to 200 000 new files
and total 22 million files
used with blackblaze, with encryption

how long a daily backup should take?

thx

A lot of this depends less on duplicacy, but on the rest of your setup. If you expect ~1% daily turnover, this means up to 300G of new data, with jpegs and pdfs not compressing much. How quickly can you push 300G to Backblaze? How quickly can your local filesystem scan 22mln files?

Also, with 30T of data you’re already at the higher end of common storage setups. More importantly, how many revisions are you planning to keep? With daily 300G increases, your storage will double in 3 months if you keep all revisions. At those sizes, your check and prune will be pretty slow with default chunk sizes, and your memory consumption will be high. Which hardware you’re running it on? In any case, you may want to take a look at my PR that significantly improves performance on such large datasets.

1 Like

Hi Sevimo, thx for your answer.

pdf and jpeg are only added to the server and almost never deleted.
I think size is less than 300GB /day. but we have more and more clients.
push to backblaze : i 'd say 2-5hours for 300GB
scanning all the files, that’s the main question. diplicacy must rescan everything each time?

3 servers are rented on ovh, we use zfs raid 5 on 6 disk + glusterFS on top
CPU : AMD Ryzen 5 Pro 3600 - 6c/12t - 3,6 GHz/4,2 GHz
storage : HDD SATA
ram 64GB

Your upload to Backblaze would likely to be your bottleneck. Duplicacy needs to scan local filesystem on backup (otherwise it won’t know which files are new/deleted/changed), but I can scan 1mm+ files on HDD RAID in less than 1min, it’s not really a bottleneck. Duplicacy won’t rehash the files that are unchanged. Overall, backup in duplicacy is fast and efficient, you’re most likely going to be constrained by your upload.

Check and prune is another story though, these need to operate on the whole storage, and this will likely be slow. You can check some benchmarks in my thread on large datasets.