How to get duplicacy respect chunk sizes?

saspus · 16 April 2024 23:02

It seems chunk sizes are ignored:

% duplicacy init --max-chunk-size=64M --chunk-size=32M test /tmp/target

% grep 'chunk-size' config
    "average-chunk-size": 33554432,
    "max-chunk-size": 67108864,
    "min-chunk-size": 8388608,

% du -hd 0 chunks
 85G	chunks

% histogram
  1k:      1
  4k:      1
 16k:      2
 32k:     18
 64k:    100
128k:    161
256k:    164
512k:    127
  1M:    251
  2M:    475
  4M:   1306
  8M:   1606
 16M:   1746
 32M:    393
 64M:     27

In this example I’m backing up /Applications/Xcode.app. 3.2.0 (981EFC)

gchen · 17 April 2024 01:54

Those file sizes are chunk sizes after the compression. Obviously some chunks can be compressed a lot.

To view the uncompressed chunk sizes, run duplicacy cat -r 1 and check the lengths array in the output.

saspus · 17 April 2024 22:21

Indeed, that worked.

  8M:   1856
 16M:   2346
 32M:   2248
 64M:   1309

For those who want to repeat the check

 duplicacy cat -r 1 | jq -r '.lengths[]'  |  awk '{ n=int(log($1)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  8M:   1856
 16M:   2346
 32M:   2248
 64M:   1309

The histogram code is from here linux - Generate distribution of file sizes from the command prompt - Super User

ADMECA · 18 April 2024 04:54

Nice

It’s rare these days to find people who still know how to use “awk”

saspus · 18 April 2024 05:08

I stole the histogram computation code from stack overflow :).

But yesterday I indeed tried to use awk, and evidently the one on macOS does not support capture groups in matching string, and the gawk requires you to actually write match function calls there — which defeats the purpose. So I gave up, and ultimately wrote a Python script to do what I wanted 🤦

So yeah. awk is really nice as a small Turing complete language. It provides just enough control over sed when you want to do something nontrivial across the lines. If it had proper capture group support in match strings — it would have been perfect. I still continue trying to use it here and there, and most of the time it works perfectly. My .zshrc is full of wrappers around various awk monstrosities.

saspus · 28 April 2024 05:09

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.