Backblaze B2 Duplicate Chunks

dullage · 4 March 2021 10:15

I have single Duplicacy repo performing a daily backup to a b2 bucket. I’ve just had a look at data in that bucket and can see over 14,000 chunks with more than one version e.g.

As there’s only the one repo backing up to this storage so I’m not sure how I’ve ended up with this.

I’ve just run duplicacy prune -exclusive -exhaustive and although it did clear some unreferenced chunks I’m still left with these hidden versions. Is there any way I can clear these up?

gchen · 4 March 2021 19:46

Can you check the prune logs to see if there are any references to that chunk? For the CLI you can find the prune logs under .duplicacy/logs but for the web GUI they are located under ~/.duplicacy-web/repositories/localhost/all/.duplicacy.

dullage · 8 March 2021 11:14

Thanks @gchen. I’ve searched all the logs we have and cannot see a reference to the example chunk above. Unfortunately, we don’t have all the logs though as some old ones have been deleted so maybe it’s referenced in those.

I can imagine I could just update the B2 Lifecycle settings to clear these out. My (possibly overly cautious) concern is whether or not the latest version of the chuck that would be kept is ok.

I can imagine a check -chunks could verify this but that would mean downloading around 3TB of data. Given the duplicates only account for ~50GB (estimate) it’s probably best that I just leave them there and wait for my prune policy to eventually delete them as they become unrequired.

arno · 8 March 2021 17:16

You could try the B2 command line tool to download both copies of one of those duplicates and compare them.

I see a few “duplicates” in my B2 buckets, but as far as I’ve found so far, the more recent copy is 0 bytes and hidden. If I remember correctly, this is how Duplicacy stages B2 chunks to be pruned.

Out of curiosity, how did you identify these 14,000 duplicate chunks and then tally that the remaining ones accounted for ~50GB? (multiplying the number by an average chunk size?)

dullage · 8 March 2021 17:39

@arno - That’s not a bad shout, I might download some and have a look.

To get the 50GB estimate I used the B2 command line tool like this: b2 ls --versions --recursive MyBucketName | uniq -cd. This produces a list of all objects and if there are multiple versions they are listed multiple times. The pipe to the Linux tool uniq counts duplicate lines (-c) and lists any that are duplicates (-d) along with a count. I simply took a rough chunk size of 4MB and multiplied it by the number of chunks listed in the output.