Possibility of a hybrid chunker

alind · 1 December 2022 23:02

Hi all,

I was modifying my filters for a backup and ended up excluding a large folder (~200GB), containing pictures, by mistake. When I fixed the filter again, the next backup produced a lot of new chunks (~40GB).

I believe I understand why this happens and what the benefits are of a pack-then-chunk approach, especially for smaller files, as compared to the fixed chunker that always restarts at file boundaries (or a hypothetical variable chunker that also restarts at file boundaries).

Now, my question is, has a hybrid approach ever been attempted and if so what were the real-world results of such an approach?

Very hand-wavy, this hybrid approach would still pack-then-chunk any files smaller than the minimum chunk size.

For files within the allowed chunk size it would either not chunk at all or possibly restart the chunker at the file boundary.

For larger files, either a pack-then-chunk or a chunk restart at file boundary could be used.

If I’m understanding it correctly, I believe this would allow my large folder of pictures >1MB but <16MB to have been deduplicated more efficiently, still without requiring an index or completely bloating the repo in terms of number of files.

Thank you for your time!

-Alex

EDIT:
Just wanted to add that the reason the large folder of pictures deduplicated so poorly is that it’s a program managed storage (Apple Photos) where files are injected at random places in the folder structure over time. A manually managed folder structure should not have this issue.

towerbr · 2 December 2022 10:58

See this PR:

github.com/gilbertchen/duplicacy

Bundle or chunk

gilbertchen:master ← kairisku:bundle_or_chunk

opened 08:56AM - 29 Jun 18 UTC

kairisku

+105 -17

This PR modifies duplicacy's bundle-and-chunk algorithm into bundle-OR-chunk, wh…ere files are processed in two passes while splitting chunks at file boundaries when possible (without chunk size limits). See lengthy discussion in issue #334. In the first pass, files smaller than the minimum chunksize are bundled together, while in the second pass all larger files are processed and split into chunks as before. In addition, all files smaller than the minimum chunksize have their content hash checked if the same content is already present in the current or previous snapshot, allowing for direct deduplication. With these changes duplicacy will handle file copying, renaming and moving of files flawlessly. Small files are immediately recognized as duplicates based on the hash of their content, while larger files that might get split into multiple chunks always start from the beginning of a chunk and thus result in identical chunks as before. There should never be unnecessary chunks formed just because files are differently packed than before.

(there are some references to topics here in the forum with discussions about it)

alind · 2 December 2022 16:48

Thank you towerbr! That does indeed sound exactly like what I was looking for.