Possibility of a hybrid chunker

Hi all,

I was modifying my filters for a backup and ended up excluding a large folder (~200GB), containing pictures, by mistake. When I fixed the filter again, the next backup produced a lot of new chunks (~40GB).

I believe I understand why this happens and what the benefits are of a pack-then-chunk approach, especially for smaller files, as compared to the fixed chunker that always restarts at file boundaries (or a hypothetical variable chunker that also restarts at file boundaries).

Now, my question is, has a hybrid approach ever been attempted and if so what were the real-world results of such an approach?

Very hand-wavy, this hybrid approach would still pack-then-chunk any files smaller than the minimum chunk size.

For files within the allowed chunk size it would either not chunk at all or possibly restart the chunker at the file boundary.

For larger files, either a pack-then-chunk or a chunk restart at file boundary could be used.

If I’m understanding it correctly, I believe this would allow my large folder of pictures >1MB but <16MB to have been deduplicated more efficiently, still without requiring an index or completely bloating the repo in terms of number of files.

Thank you for your time!

-Alex

EDIT:
Just wanted to add that the reason the large folder of pictures deduplicated so poorly is that it’s a program managed storage (Apple Photos) where files are injected at random places in the folder structure over time. A manually managed folder structure should not have this issue.

See this PR:

(there are some references to topics here in the forum with discussions about it)

1 Like

Thank you towerbr! That does indeed sound exactly like what I was looking for.