Two problems with duplicacy: backups of read-only data and chunks too large

n5029fgm · 2 January 2022 01:53

Hi, there are two significant problems I have with duplicacy which I think should be corrected in the upcoming overhaul.

Duplicacy cannot backup a read-only directory. It’s a bit crazy that this is even an issue. Anyway, there is a workaround for this, which is needlessly complicated. Assume that we want to backup /ro/ which is a read-only directory. This is what we have to do:

Create a temporary directory tmp.
cd into tmp. This will be the location we backup.
For the directory you actually want to backup, make a symlink to each and every item inside that directory. Place those symlinks inside this directory (tmp).
If the directory you want to backup contains symlinks at the top level, then you need to instead create symlinks that point to the same targets. (duplicacy follows symlinks in the top level).
Run the backup.
Delete the temporary directory, cd .. && rm -r tmp.

Note that the -pref-dir option doesn’t help. It still tries to write to the current directory, and you’re only allowed to backup the current directory (for some reason?).

In bash, to transparently accomplish this goal of backing up a read-only directory and cleaning things up properly, we need to do something like:

TMP_DIR="$(mktemp -d)"
cd "$TMP_DIR"
< complicated loop to go through each of /ro/*, determine if it is a symlink, and either make a symink to it, or make a symlink that points to the same place, if it is a symlink. >
duplicacy init ...
duplicacy backup ...
cd - 1> /dev/null
rm -r "$TMP_DIR"
unset TMP_DIR

The CLI interface should explicitly support read-only directories (why would we assume that the directory we want to back up is writeable…? Makes no sense to me), and an option to select the location of directory we are backing up. I.e. it should be as simple as one command:

duplicacy backup ---source /ro/ --storage /wherever.

I would suggest following the design of duplicity for the CLI.

The second problem is the chunk size limitation. The software presents a false illusion that we have control over chunk size. I tried to set the minimum chunk size to 100 bytes, as many of my files are only about this size. Alas, duplicacy does not indicate any kind of error, but silently replaces my choice with 1 MB. This is totally unworkable when backing up a large number of small files. There is also an opaqueness about how the chunk sizes are determined. What is the “target” or “average” chunk size? If I set a minimum chunk size of 0 for instance, then I expect duplicacy to make a chunk no larger than a file. I.e. a chunk would boundary would always exist between files. i.e. there’s no minimum chunk size, so set it as small as needed for optimum storage efficiency. A minimum chunk size is specified for performance reasons, right?

There is a comparison of some different chunk sizes between software here: Tweaking the chunker block size targets · Issue #1071 · restic/restic · GitHub

My use case is this:

I want to backup a read-only directory which contains a lot of files with file sizes that vary dramatically between 100B and about 25MiB. These files are named with random hashes, and modifications to these files are effectively random. The way that duplicacy performs packing and splitting (according to a forum post whose link I lost) is to order the files lexicographically by name or path, then sequentially create chunks. Because the files in my project are randomly modified and named, a small change to the data of the project corresponds to changes in basically random locations within this long list of filenames. So, roughly speaking, every modified file results in a modified chunk. So each file change can result in almost 4 MiB of additional chunks being created. Even when those files are miniscule, like 100B.

Example from the -tabular output of duplicacy, where small changes in the repository size (indeed, only small changes in the project) are resulting in massively inflated backup size: (I initialised this repository with default chunk size settings)

   snap | rev |                          | files |     bytes | chunks |    bytes | uniq |    bytes | new |    bytes |
...
project | 363 | @ 2021-12-28 23:45 -hash |  1027 |  986,584K |    219 | 881,910K |    0 |        0 |   0 |        0 |
project | 364 | @ 2021-12-29 00:32 -hash |  1041 |  987,007K |    219 | 882,062K |    8 |  25,987K |  19 |  80,601K |
...
project | 400 | @ 2021-12-31 19:00 -hash |  1158 |  989,490K |    232 | 884,217K |    0 |        0 |   0 |        0 |
project | 401 | @ 2021-12-31 20:00 -hash |  1217 | 1000,750K |    231 | 894,744K |   37 | 204,203K |  59 | 300,956K |
project | 402 | @ 2021-12-31 21:00 -hash |  1319 | 1030,982K |    242 | 920,021K |    0 |        0 |  96 | 449,702K |
...
project | 423 | @ 2022-01-01 18:00 -hash |  1319 | 1030,982K |    242 | 920,021K |    0 |        0 |   0 |        0 |
project | 424 | @ 2022-01-01 19:00 -hash |  1327 | 1034,893K |    241 | 923,528K |    0 |        0 |  16 |  78,335K |
project | 425 | @ 2022-01-01 20:00 -hash |  1327 | 1034,893K |    241 | 923,528K |    0 |        0 |   0 |        0 |
project | 426 | @ 2022-01-01 21:00 -hash |  1367 | 1040,162K |    236 | 928,003K |    0 |        0 |  43 | 228,948K |

I would also note at this point that the column headers of -tabular are not descriptive enough and I’m left guessing about what they mean. I think(?) that the third bytes column indicates the size of unique chunks added in the given revision, and adding this column up gives the total backup size including all revisions.

Note how from revision 363 --> 364 there was an addition of 14 files, adding up to about 500KB. Yet, the additional chunks took up 26MB. Then from 400 --> 401 there was an addition of files adding up to about 10MB, but this took up 200MB. There is also some file renaming and moving and edits going on too, but mostly it is additions. When I look through the list of files in the project, it is clear that disparate files are being modified each time. I.e. visible from the last modified dates from the output of ls -lahF: (this particular directory created on Dec 27)

-rw-r--r-- 1 root root  974K Dec 27 18:59 ac2caef427ebcd02079eb67f0c57b0888a18f3be3536d6992c8c9d5320f9a50e
-rw-r--r-- 1 root root  1.8M Dec 27 18:58 ac40a0e0cf75a7ff62834305373d36537cb1f5e409f78f778e3e518e23baa016
-rw-r--r-- 1 root root  1.1M Dec 27 18:59 ac473e8f176cf0ba245463d6c0fb6a02e08c974cb844031406754157283e7a9b
-rw-r--r-- 1 root root  1.8K Dec 27 18:59 ac6e2e5bbb943713f459b5a728462af7fdac6e9d5f5d48a2603903002e6bd243
-rw-r--r-- 1 root root  2.4K Dec 27 19:00 ac77a7fffd477f20c6fc20f2e93cb7d6f931a988db301e5d09c90a810019033f
-rw-r--r-- 1 root root  3.1M Dec 27 18:59 acce9b93f87ca6376ec1918f0e06cbe3539d0d717993264de39a992a7601edf2
-rw-r--r-- 1 root root  350K Dec 27 18:59 ad0df003c8cfd0edf85b0dcbfcd66635f006e75c78c087169ad435ffb0ed4a5d
-rw-r--r-- 1 root root   357 Dec 27 19:01 ad6e4aa7a0d45b974c8305d8f591ee537280ef86c0c64dc18f5ef453139e53a3
-rw-r--r-- 1 root root  4.0K Dec 27 19:01 ad95cb52efc18e9f7ae824b6aa3a2eed5b43d891dcd8301eb0c6838b6bf90409
-rw-r--r-- 1 root root  2.3K Dec 31 20:40 adc13edc09055c4d45475ef364bbf9289621aefb868cb3441a4efe9d4e6db024
-rw-r--r-- 1 root root  1.6M Dec 27 18:59 aded6ff3d9ce8ed6ed82fc409ba5351dcefa3ceabbab13a0a5d051cd63f3dbd9
-rw-r--r-- 1 root root   14K Dec 31 20:51 ae03b81f8389373dc23d599ccb700c7b189832861885ea28dcab31c1ede1b2df
-rw-r--r-- 1 root root   475 Dec 29 01:11 ae23817e0c2394b08a93383c9eced9bd280230365513be00f446bb152268579d
-rw-r--r-- 1 root root  1.6M Dec 27 19:00 ae4eb5cbf12f4a0ea16112a5e30e8ce4b85bde7921ff30dcdfecd5fef93cbbf5
-rw-r--r-- 1 root root  1.6M Dec 27 18:57 ae9166e3d5e66aefd21ccf18caa5de191aa05535181d5bd092a73d7265805f68
-rw-r--r-- 1 root root   361 Dec 27 18:57 ae9d02fbeb0df1a68c9c963cd79ba81170ac0d52fab4593c22e1d2b519111795
-rw-r--r-- 1 root root  1.6M Dec 27 18:59 aeb0370a0c641b90d693e2ebd45287bd992271bdc94208595c925de1864f41be
-rw-r--r-- 1 root root  4.9K Dec 27 19:00 aeb580669314297fd346b8542c108dadf460e059765e1097bd957114f8722648
-rw-r--r-- 1 root root   589 Dec 29 00:36 af8dbcbd3001600faab2b07cf7c21e3d557e379cce900ee12cf32cd5a414453f
-rw-r--r-- 1 root root   672 Dec 31 20:40 af8f2b245074311b42737a6600b72d9e5205a6f902dc87265e3b9f2ec75af23e
-rw-r--r-- 1 root root   13K Dec 29 01:18 afadf326fbeb3b7f3c84e243513aeafcd1e8dd02c8694d9a9e22402eeffeb8e1
-rw-r--r-- 1 root root  1.6M Dec 27 18:58 afec7fbc0175216d06ea75e8955acaa1b22d5f47bde119744587a1a0cba7fa22
-rw-r--r-- 1 root root  387K Dec 27 19:00 b0677d76ee7277ccd4cfbcc36e2e8a9452d423c37fc8c75350f60e8fa5661215
-rw-r--r-- 1 root root  1.5M Dec 27 18:59 b0dd4acb30e75b8622a1acc5b08f1ccd05a2c1acf62d44000675e9d85a9a9e08
-rw-r--r-- 1 root root  797K Jan  1 18:23 b13e0250436fc9212e1de885ca0859782e644d664ce3e0a4bb2232e171a0f760
-rw-r--r-- 1 root root  844K Dec 27 19:00 b145132d21af3d5d8d0df5d6591c09d93c54cec29d4556db26c7d655c00fab99
-rw-r--r-- 1 root root  794K Dec 27 18:57 b188266f05cd3a4497970530cd22957e3565ad15888b8480398e2da31a71119b
-rw-r--r-- 1 root root  2.4K Dec 31 20:32 b188cb066357ba476b4e9aa6e2e3e138626b82dbefeba5edf0ae2e934cc6896f
-rw-r--r-- 1 root root   358 Dec 27 19:01 b1e617f96fb8c581b4f4837de4f5a306a02f946f808ca3b28dd7cd46ec1aa622
-rw-r--r-- 1 root root  150K Dec 27 18:57 b26f5f3d8522186283193143c933a918748ae9b921ff101e8122d498070f1210
-rw-r--r-- 1 root root  1.8M Dec 27 18:58 b283f5bfb82a484cefce1923e5df5cb976c2bd7cbed9237db17a404e5e98de47
-rw-r--r-- 1 root root  1.6K Dec 27 19:00 b2c167c3548929ac53b4444c3fcf55fcfe21fef9a70e2cd4c42fc64805509458
-rw-r--r-- 1 root root  2.7M Dec 27 18:59 b2c26bf899feefaa4683f1be8da448a5950980549a79f6557a2b98e26560bcc6
-rw-r--r-- 1 root root   826 Dec 29 00:35 b2db27a98da1cfd118633995e0477ad96980ea14e0ee7e3019fdfe2f588062bd

Whenever I make just small edit to the project, the chunk sizes are massive. It results in inflated backup size:

   snap | rev |                          | files |     bytes | chunks |    bytes | uniq |    bytes | new |    bytes |
...
project | 426 | @ 2022-01-01 21:00 -hash |  1367 | 1040,162K |    236 | 928,003K |    0 |        0 |  43 | 228,948K |
project | all |                          |       |           |   1024 |   4,341M | 1024 |   4,341M |     |          |

At a mathematical level, this problem would not exist if duplicacy didn’t assume rely on the assumption that this map is continuous f : <semantic changes to files> --> <mask of alterred bits, ordered lexicographically by filename>. I.e., small changes to a project do not imply that a close cluster of bits change. Small changes can result in sparse changes to the data, and that is the case for me. The chunking method usually seems to handle this okay, but it assumes a certain ‘lower bound’ on continuity-- i.e. the smallest cluster of changed bits is at least the average chunk size. In my case, the lowest allowed chunk size 1M is about ten thousand times larger than the more reasonable choice of 100B.

Maybe there would be untenable performance implications if chunks were so small. That’s fine, but then why not allow chunks to be as small as they need so that there’s at most one complete file in a chunk?That’s what I assumed the minimum chunk size setting was for. But it doesn’t work. (silently replaces choice with 1MB minimum). Most of the repository data would be contained in large chunks, but small chunks might outnumber the big ones.

Droolio · 2 January 2022 16:28

While the current documentation doesn’t explicitly say it, the -pref-dir option should be considered deprecated - use -repository during init instead.

This ensures the .duplicacy directory is created away from the repository root, and therefore shouldn’t require write access.

When initialising the storage with init, Duplicacy should tell you:

Invalid average chunk size: 100 is not a power of 2.

The value should be in bytes, e.g.: duplicacy init -chunk-size 524288 test F:\storage

Huh? If a modified file is 100 bytes, where’s the extra 3.9999 MB coming from? Unmodified files aren’t packed on incremental backups…

IMO, 100 bytes is probably way too small for your use case, and you should choose a decent average based on your overall data rather than the smallest files you have. Obviously, some testing may be needed if you’re that fussed about overhead, but it’s a balancing act - API calls to the storage (which may be of more of concern if it’s cloud storage versus local) compared to different levels of deduplication.

You could also choose fixed size chunks by setting -min-chunk-size and -max-chunk-size to the same value. Doing so will also enforce chunks to be split at file boundaries. See.

n5029fgm · 10 January 2022 00:11

Alright, the -repository flag seems to solve the read-only issue.

But the large chunk size is still a problem.

Run this script in an empty directory.

#!/usr/bin/env bash

# Make random read-only data in ./ro/
mkdir ro
for i in {1..10000}; do
  echo "$RANDOM" > "ro/$RANDOM.txt"
done
chmod -R ugo-w ro

# Make directories
mkdir repo
mkdir storage

# Initialise backup
cd repo
duplicacy init -repository "$(readlink -f ../ro)" sid "$(readlink -f ../storage)"
file_arr=($(ls -1 ../ro))
num_files=${#file_arr}
for i in {1..10}; do
  duplicacy backup -hash
  file_index=$((RANDOM%num_files))
  file_path="../ro/${file_arr[$file_index]}"
  chmod u+w "$file_path"
  echo "$RANDOM" > "$file_path"
  chmod u-w "$file_path"
done
duplicacy check -id "sid" -tabular
cd ..

# Cleanup
chmod -R ugo+w ro
rm -r ro
rm -r repo
rm -r storage

One tiny file is modified between each backup. Tabular output of stats:

  snap | rev |                          | files | bytes | chunks |  bytes | uniq |  bytes | new | bytes |
   sid |   1 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    3 |   673K |   4 |  673K |
   sid |   2 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    3 |   673K |   4 |  673K |
   sid |   3 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    3 |   673K |   3 |  673K |
   sid |   4 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    3 |   673K |   3 |  673K |
   sid |   5 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    3 |   673K |   3 |  673K |
   sid |   6 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    3 |   673K |   3 |  673K |
   sid |   7 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    3 |   673K |   3 |  673K |
   sid |   8 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    3 |   673K |   4 |  673K |
   sid |   9 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    4 |   673K |   4 |  673K |
   sid |  10 | @ 2022-01-10 11:10 -hash |  8598 |   47K |      4 |   673K |    3 |   673K |   3 |  673K |
   sid | all |                          |       |       |     34 | 6,731K |   34 | 6,731K |     |       |

gchen · 10 January 2022 01:16

The default chunk size is 4MB, so the deduplication overhead should be at the MB level. The total amount of files in your test is only 6.7MB, and Duplicacy wasn’t really optimized for usage like this.

n5029fgm · 10 January 2022 01:42

Nonstandard use is why the chunk size can be chosen via parameters, yes? So can you remove the hard limits on minimum chunk size and allow us to specify it all the way down to 1 byte if we want?

It doesn’t matter if the metadata of chunks is significantly larger than the minimum chunk size. Because it’s still a minimum chunk size which can be used for small files. Statistically, I think most of the data in my repository is contained in large files.

n5029fgm · 10 January 2022 01:55

Also the total size of the repository is not the issue. The issue is that the smallest change duplicacy can track is 1MB (minimum chunk size). Usually it will be 4MB. So any small change to any file anywhere, if the changes are sparse over the lexicographically ordered list of files, wil be 4MB per change. Even if only a single byte changes!

Change 100 bytes in random places across that great list of files, and you’ll get 400MB worth of new chunks to track. Because the chunk size is too large. Need a bigger repository to demonstrate.

Another test with 100 times more files (had to double the filename length to keep them unique):

  snap | rev |                          |  files |  bytes | chunks |    bytes | uniq |    bytes | new |   bytes |
   sid |   1 | @ 2022-01-10 12:47 -hash | 999347 | 5,523K |     39 |  84,668K |    3 |   3,573K |  39 | 84,668K |
   sid |   2 | @ 2022-01-10 12:48 -hash | 999347 | 5,523K |     39 |  84,668K |    3 |   3,573K |   3 |  3,573K |
   sid |   3 | @ 2022-01-10 12:49 -hash | 999347 | 5,523K |     38 |  84,669K |   19 |  39,520K |  19 | 39,520K |
   sid |   4 | @ 2022-01-10 12:50 -hash | 999347 | 5,523K |     40 |  84,670K |   21 |  39,521K |  21 | 39,521K |
   sid |   5 | @ 2022-01-10 12:50 -hash | 999347 | 5,523K |     37 |  84,667K |   18 |  39,518K |  18 | 39,518K |
   sid |   6 | @ 2022-01-10 12:51 -hash | 999347 | 5,523K |     37 |  84,666K |    3 |   6,209K |  18 | 39,517K |
   sid |   7 | @ 2022-01-10 12:51 -hash | 999347 | 5,523K |     37 |  84,667K |    3 |   6,210K |   3 |  6,210K |
   sid |   8 | @ 2022-01-10 12:52 -hash | 999347 | 5,523K |     37 |  84,667K |    3 |   6,210K |   3 |  6,210K |
   sid |   9 | @ 2022-01-10 12:52 -hash | 999347 | 5,523K |     37 |  84,666K |    3 |   6,209K |   3 |  6,209K |
   sid |  10 | @ 2022-01-10 12:53 -hash | 999347 | 5,523K |     37 |  84,667K |    3 |   6,210K |   3 |  6,210K |
   sid | all |                          |        |        |    130 | 271,162K |  130 | 271,162K |     |         |

n5029fgm · 10 January 2022 03:08

Here’s a script which replicates the approximate situation that I find myself in with my repository. The repository size is over 100MB. A sequence of 10 incremental backups is made, with just 100 bytes of difference between them. The total backup size is over 1GB.

Run this in an empty directory.

#!/usr/bin/env bash


CONFIG_NUM_DATAFILE=1000
CONFIG_NUM_BACKUPS=10
CONFIG_NUM_CHANGES_PER_BACKUP=100

BC_POWFUN="
define pow(a, b) {
    if (scale(b) == 0) {
        return a ^ b;
    };
    return e(b*l(a));
};
"

myrootdir="$(pwd)"

cleanup() {
  echo 1>&2
  echo "Aborting; cleaning up..." 1>&2
  if cd "$myrootdir"; then
    chmod -R ugo+w ro 2>/dev/null
    rm -r ro          2>/dev/null
    rm -r repo        2>/dev/null
    rm -r storage     2>/dev/null
  fi
  echo "Aborted. Bye." 1>&2
  exit 2
}

trap cleanup SIGINT

# Make random read-only data in ./ro/
mkdir ro
printf "Generating initial data... %7s of %7s" 0 $CONFIG_NUM_DATAFILE
for ((i = 0; i < CONFIG_NUM_DATAFILE; i++)); do
  if [[ "$((RANDOM%10))" -lt "9" ]]; then
    # 9/10 chance - small file - 1K max
    file_size=$(bc <(echo $BC_POWFUN) -l <<< "pow(10,($((RANDOM%300))/100))")
    # Round to integer number of bytes
    file_size=$(echo $file_size | python -c "print(int(float(input()) + 0.5))")
    dd if=/dev/urandom of="ro/$RANDOM$RANDOM.txt" bs=1 count=$file_size &> /dev/null
  else
    # 1/10 chance - large file - 10M max
    file_size=$(bc <(echo $BC_POWFUN) -l <<< "pow(10,($((RANDOM%400))/100))" )
    # Round to integer number of KB
    file_size=$(echo $file_size | python -c "print(int(float(input()) + 0.5))")
    dd if=/dev/urandom of="ro/$RANDOM$RANDOM.txt" bs=1K count=$file_size &> /dev/null
  fi
  # Round to integer number of bytes
  printf "\rGenerating initial data... %7s of %7s" $((i+1)) $CONFIG_NUM_DATAFILE
done
echo
echo

# Make data read-only
chmod -R ugo-w ro

# Make directories
mkdir repo
mkdir storage

# cd
cd repo

# Initialise backup
echo "Initialising duplicacy repository"
duplicacy init -repository "$(readlink -f ../ro)" sid "$(readlink -f ../storage)"

# Function to change N files at random
make_random_changes() {
  local num_changes=$1
  local file_index
  local file_path
  local i
  for ((i = 0; i < num_changes; i++)); do
    file_index=$((RANDOM%num_files))
    file_path="../ro/${file_arr[$file_index]}"
    chmod u+w "$file_path"
    echo -ne \\xef | dd conv=notrunc bs=1 count=1 of="$file_path" &> /dev/null
    #echo "$RANDOM" > "$file_path"
    chmod u-w "$file_path"
  done
}
file_arr=($(ls -1 ../ro))
num_files=${#file_arr[@]}

echo
echo "Making single-byte changes $CONFIG_NUM_CHANGES_PER_BACKUP files at random, then backing up. Repeating $CONFIG_NUM_BACKUPS times."
for ((i = 0; i < CONFIG_NUM_BACKUPS; i++)); do
  make_random_changes $CONFIG_NUM_CHANGES_PER_BACKUP
  duplicacy backup -hash -threads 4 > /dev/null
done
echo
echo "Running 'duplicacy check ...'"
echo
duplicacy check -id "sid" -tabular

# Cleanup
cleanup

End of output (numbers subject to pseudorandomness):

...
  snap | rev |                          | files |   bytes | chunks |    bytes | uniq |    bytes | new |   bytes |
   sid |   1 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     22 |  76,729K |   16 |  59,041K |  22 | 76,729K |
   sid |   2 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     21 |  76,729K |   14 |  54,498K |  15 | 59,040K |
   sid |   3 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     22 |  76,729K |   16 |  58,627K |  16 | 58,627K |
   sid |   4 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     21 |  76,728K |   17 |  65,275K |  17 | 65,275K |
   sid |   5 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     21 |  76,728K |   16 |  63,735K |  16 | 63,735K |
   sid |   6 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     18 |  76,728K |   14 |  68,278K |  14 | 68,278K |
   sid |   7 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     19 |  76,727K |   14 |  61,858K |  15 | 68,278K |
   sid |   8 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     19 |  76,727K |   13 |  57,315K |  14 | 61,858K |
   sid |   9 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     19 |  76,727K |   14 |  63,734K |  14 | 63,734K |
   sid |  10 | @ 2022-01-10 14:09 -hash |  1000 | 76,338K |     19 |  76,727K |   14 |  63,734K |  14 | 63,734K |
   sid | all |                          |       |         |    157 | 649,293K |  157 | 649,293K |     |         |

Droolio · 10 January 2022 16:06

First of all, why are you using -hash? It’s unnecessary to do this every backup. This alone makes a significant difference in your test.

Secondly, why didn’t you try -chunk-size like I suggested? 4096 seems to work well. Hell, even -min-chunk-size 1 works, although it takes a fair while and doesn’t seem to improve storage efficiency anyway.

n5029fgm · 11 January 2022 00:46

I’m using -hash because I want to / need to for my application. It’s not even relevant to this thread.

I did use -chunk-size already, did you read my original post?

saspus · 11 January 2022 01:08

Please elaborate on the reasons.

There is nothing about it in your original post; I only see the suggestion to try fixed chunk size (twice!) and no response from you.

Please keep discussion civil.

n5029fgm · 11 January 2022 02:24

Please elaborate on the reasons.

The reasons are not relevant. However, according to gchen, the -hash option is necessary to ensure that all files are backed up. It affects CPU performance. CPU performance is not the issue in this thread. -hash can be left out when it is known that timestamps will change with any file modifications. In the above script, I guess that timestamps of files are not always modified because the script runs quickly. It is a red herring. I need and/or want it for my application, because I cannot guarantee that file changes happen in a way that modifies the timestamp.

Please keep discussion civil.

Please don’t assume that I am being sarcastic when I ask if someone read my original post. I explained my reasoning behind why this was an issue, relating to continuity and lexicographic ordering. Nobody commented on this. It’s reasonable to think that nobody read it.

If I have made a mistake in my reasoning somewhere, then please pinpoint it.

There is nothing about it in your original post

Original post:

I tried to set the minimum chunk size to 100 bytes , as many of my files are only about this size. Alas, duplicacy does not indicate any kind of error, but silently replaces my choice with 1 MB .

I have been experimenting with different parameters and I cannot replicate the hard 1 MB minimum chunk size limit that I observed initially. Perhaps I made a mistake there. If so, it would be productive to point that out rather than suggesting I try a thing I already tried.

To reiterate, my issue is the deduplication inefficiency. Dropping the -hash option is not acceptable; the test script does not always update timestamps. It’s a red herring. Nonetheless, the issue is visible even with the -hash option omitted, just less severe.

Here’s an updated script:

#!/usr/bin/env bash


CONFIG_NUM_DATAFILE=1000
CONFIG_NUM_BACKUPS=10
CONFIG_NUM_CHANGES_PER_BACKUP=100

BC_POWFUN="
define pow(a, b) {
    if (scale(b) == 0) {
        return a ^ b;
    };
    return e(b*l(a));
};
"

myrootdir="$(pwd)"

cleanup() {
  echo 1>&2
  echo "Aborting; cleaning up..." 1>&2
  if cd "$myrootdir"; then
    chmod -R ugo+w ro 2>/dev/null
    rm -r ro          2>/dev/null
    rm -r repo        2>/dev/null
    rm -r storage     2>/dev/null
  fi
  echo "Aborted. Bye." 1>&2
  exit 2
}

trap cleanup SIGINT

# Make random read-only data in ./ro/
mkdir ro
printf "Generating initial data... %7s of %7s" 0 $CONFIG_NUM_DATAFILE
for ((i = 0; i < CONFIG_NUM_DATAFILE; i++)); do
  if [[ "$((RANDOM%10))" -lt "9" ]]; then
    # 9/10 chance - small file - 1K max
    file_size=$(bc <(echo $BC_POWFUN) -l <<< "pow(10,($((RANDOM%300))/100))")
    # Round to integer number of bytes
    file_size=$(echo $file_size | python -c "print(int(float(input()) + 0.5))")
    dd if=/dev/urandom of="ro/$RANDOM$RANDOM.txt" bs=1 count=$file_size &> /dev/null
  else
    # 1/10 chance - large file - 10M max
    file_size=$(bc <(echo $BC_POWFUN) -l <<< "pow(10,($((RANDOM%400))/100))" )
    # Round to integer number of KB
    file_size=$(echo $file_size | python -c "print(int(float(input()) + 0.5))")
    dd if=/dev/urandom of="ro/$RANDOM$RANDOM.txt" bs=1K count=$file_size &> /dev/null
  fi
  # Round to integer number of bytes
  printf "\rGenerating initial data... %7s of %7s" $((i+1)) $CONFIG_NUM_DATAFILE
done
echo
echo

# Make data read-only
chmod -R ugo-w ro

# Make directories
mkdir repo
mkdir storage

# cd
cd repo

# Initialise backup
echo "Initialising duplicacy repository"
#duplicacy init -c 512 -min 1 -max 10M -repository "$(readlink -f ../ro)" sid "$(readlink -f ../storage)"
duplicacy init -repository "$(readlink -f ../ro)" sid "$(readlink -f ../storage)"
duplicacy info -repository "$(readlink -f ../ro)" "$(readlink -f ../storage)"

# Function to change N files at random
make_random_changes() {
  trap cleanup SIGINT
  local num_changes=$1
  local file_index
  local file_path
  local i
  local RAND
  for ((i = 0; i < num_changes; i++)); do
    RAND=$RANDOM$RANDOM$RANDOM # Big in case we have a lot of files...
    file_index=$((RAND%num_files))
    file_path="../ro/${file_arr[$file_index]}"
    chmod u+w "$file_path"
    echo -ne \\xef | dd conv=notrunc bs=1 count=1 of="$file_path" &> /dev/null
    #echo "$RANDOM" > "$file_path"
    chmod u-w "$file_path"
  done
}
file_arr=($(ls -1 ../ro))
num_files=${#file_arr[@]}

echo
echo "Making single-byte changes $CONFIG_NUM_CHANGES_PER_BACKUP files at random, then backing up. Repeating $CONFIG_NUM_BACKUPS times."
for ((i = 0; i < CONFIG_NUM_BACKUPS; i++)); do
  make_random_changes $CONFIG_NUM_CHANGES_PER_BACKUP
  #duplicacy backup -threads 4 > /dev/null
  duplicacy backup -hash -threads 4 > /dev/null
done
echo
echo "Running 'duplicacy check ...'"
echo
duplicacy check -id "sid" -tabular | grep -E "^...... [|] "

# Cleanup
cleanup

Here are tests with the duplication inefficiency shown as a ratio. It is the average number of new bytes stored per byte changed. It is the total backup size, including all chunks, minus the ‘real’ size of the latest snapshot, divided by 900 (9 changes with 100 bytes each). This adds up quickly over time. I am only running 10 backups here but the real repo takes hourly backups.

Remember this is a demonstration. Use your imagination for how this might scale for larger repositories, with more files changes, etc.

Here’s a test with -hash and duplicacy init -c 512 -min 1 -max 10M ... (takes maybe 15+ minutes). Duplication ineffiency: ~954,000 (yes that’s right, storing 1 million times more than strictly necessary, per incremental change):

Generating initial data...    1000 of    1000

Initialising duplicacy repository
/home/user/tmp/duptest/ro will be backed up to /home/user/tmp/duptest/storage with id sid
Compression level: 100
Average chunk size: 4194304
Maximum chunk size: 8388608
Minimum chunk size: 1
Chunk seed: 6475706c6963616379

Making single-byte changes 100 files at random, then backing up. Repeating 10 times.

Running 'duplicacy check ...'

  snap | rev |                          | files |    bytes | chunks |    bytes | uniq |    bytes | new |    bytes |
   sid |   1 | @ 2022-01-11 12:56 -hash |  1000 | 132,264K |     20 | 132,875K |   15 |  99,979K |  20 | 132,875K |
   sid |   2 | @ 2022-01-11 12:57 -hash |  1000 | 132,264K |     20 | 132,875K |   13 |  83,530K |  15 |  99,979K |
   sid |   3 | @ 2022-01-11 12:58 -hash |  1000 | 132,264K |     20 | 132,875K |   14 |  91,754K |  14 |  91,754K |
   sid |   4 | @ 2022-01-11 12:58 -hash |  1000 | 132,264K |     20 | 132,875K |   13 |  83,530K |  16 | 108,202K |
   sid |   5 | @ 2022-01-11 12:59 -hash |  1000 | 132,264K |     20 | 132,875K |   13 |  83,530K |  14 |  91,754K |
   sid |   6 | @ 2022-01-11 13:00 -hash |  1000 | 132,264K |     20 | 132,875K |   12 |  75,306K |  15 |  99,978K |
   sid |   7 | @ 2022-01-11 13:01 -hash |  1000 | 132,264K |     20 | 132,874K |   13 |  83,530K |  13 |  83,530K |
   sid |   8 | @ 2022-01-11 13:02 -hash |  1000 | 132,264K |     20 | 132,874K |   14 |  91,753K |  15 |  99,978K |
   sid |   9 | @ 2022-01-11 13:03 -hash |  1000 | 132,264K |     20 | 132,874K |   13 |  83,529K |  14 |  91,753K |
   sid |  10 | @ 2022-01-11 13:04 -hash |  1000 | 132,264K |     20 | 132,874K |   14 |  91,753K |  14 |  91,753K |
   sid | all |                          |       |          |    150 | 991,561K |  150 | 991,561K |     |          |

Aborting; cleaning up...
Aborted. Bye.

Here’s a test with duplicacy init -c 512 -min 1 -max 10M ... only. Duplication ineffiency: ~132,000:

Generating initial data...    1000 of    1000

Initialising duplicacy repository
/home/user/tmp/duptest/ro will be backed up to /home/user/tmp/duptest/storage with id sid
Compression level: 100
Average chunk size: 4194304
Maximum chunk size: 8388608
Minimum chunk size: 1
Chunk seed: 6475706c6963616379

Making single-byte changes 100 files at random, then backing up. Repeating 10 times.

Running 'duplicacy check ...'

  snap | rev |                          | files |    bytes | chunks |    bytes | uniq |    bytes | new |    bytes |
   sid |   1 | @ 2022-01-11 13:17 -hash |  1000 | 106,778K |     17 | 107,289K |    3 |      91K |  17 | 107,289K |
   sid |   2 | @ 2022-01-11 13:18       |  1000 | 106,778K |     19 | 119,290K |    3 |      92K |   5 |  12,092K |
   sid |   3 | @ 2022-01-11 13:18       |  1000 | 106,778K |     21 | 130,382K |    3 |      92K |   5 |  11,184K |
   sid |   4 | @ 2022-01-11 13:18       |  1000 | 106,778K |     22 | 132,307K |    3 |      92K |   4 |   2,016K |
   sid |   5 | @ 2022-01-11 13:18       |  1000 | 106,778K |     24 | 145,847K |    3 |      92K |   5 |  13,632K |
   sid |   6 | @ 2022-01-11 13:18       |  1000 | 106,778K |     26 | 154,486K |    3 |      92K |   5 |   8,731K |
   sid |   7 | @ 2022-01-11 13:18       |  1000 | 106,778K |     30 | 187,151K |    3 |      93K |   7 |  32,757K |
   sid |   8 | @ 2022-01-11 13:19       |  1000 | 106,778K |     32 | 202,419K |    3 |      93K |   5 |  15,360K |
   sid |   9 | @ 2022-01-11 13:19       |  1000 | 106,778K |     34 | 211,316K |    3 |      93K |   5 |   8,989K |
   sid |  10 | @ 2022-01-11 13:19       |  1000 | 106,778K |     36 | 224,416K |    5 |  13,192K |   5 |  13,192K |
   sid | all |                          |       |          |     63 | 225,249K |   63 | 225,249K |     |          |

Aborting; cleaning up...
Aborted. Bye.

Here’s a test with -hash only. Duplication ineffiency: ~924,000:

Generating initial data...    1000 of    1000

Initialising duplicacy repository
/home/user/tmp/duptest/ro will be backed up to /home/user/tmp/duptest/storage with id sid
Compression level: 100
Average chunk size: 4194304
Maximum chunk size: 16777216
Minimum chunk size: 1048576
Chunk seed: 6475706c6963616379

Making single-byte changes 100 files at random, then backing up. Repeating 10 times.

Running 'duplicacy check ...'

  snap | rev |                          | files |    bytes | chunks |    bytes | uniq |    bytes | new |    bytes |
   sid |   1 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     29 | 112,245K |   21 |  94,423K |  29 | 112,245K |
   sid |   2 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     29 | 112,245K |   18 |  81,702K |  24 | 100,794K |
   sid |   3 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     29 | 112,245K |   19 |  80,812K |  21 |  88,220K |
   sid |   4 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     28 | 112,245K |   18 |  83,722K |  22 |  96,064K |
   sid |   5 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     24 | 112,244K |   13 |  75,151K |  15 |  87,549K |
   sid |   6 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     26 | 112,244K |   17 |  85,089K |  19 |  88,670K |
   sid |   7 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     24 | 112,244K |   18 |  91,254K |  18 |  91,254K |
   sid |   8 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     24 | 112,244K |   15 |  85,303K |  18 |  91,052K |
   sid |   9 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     26 | 112,244K |   16 |  83,454K |  17 |  85,859K |
   sid |  10 | @ 2022-01-11 13:10 -hash |  1000 | 111,714K |     25 | 112,244K |   20 | 101,802K |  20 | 101,802K |
   sid | all |                          |       |          |    203 | 943,513K |  203 | 943,513K |     |          |

Aborting; cleaning up...
Aborted. Bye.

Here’s a test with neither. Duplication ineffiency: ~38,000:

Generating initial data...    1000 of    1000

Initialising duplicacy repository
/home/user/tmp/duptest/ro will be backed up to /home/user/tmp/duptest/storage with id sid
Compression level: 100
Average chunk size: 4194304
Maximum chunk size: 16777216
Minimum chunk size: 1048576
Chunk seed: 6475706c6963616379

Making single-byte changes 100 files at random, then backing up. Repeating 10 times.

Running 'duplicacy check ...'

  snap | rev |                          | files |   bytes | chunks |    bytes | uniq |    bytes | new |   bytes |
   sid |   1 | @ 2022-01-11 13:16 -hash |  1000 | 68,192K |     16 |  68,552K |    3 |      91K |  16 | 68,552K |
   sid |   2 | @ 2022-01-11 13:16       |  1000 | 68,192K |     17 |  72,902K |    3 |      92K |   4 |  4,442K |
   sid |   3 | @ 2022-01-11 13:16       |  1000 | 68,192K |     18 |  74,409K |    3 |      92K |   4 |  1,599K |
   sid |   4 | @ 2022-01-11 13:16       |  1000 | 68,192K |     22 |  87,857K |    3 |      92K |   7 | 13,540K |
   sid |   5 | @ 2022-01-11 13:16       |  1000 | 68,192K |     23 |  91,392K |    3 |      92K |   4 |  3,627K |
   sid |   6 | @ 2022-01-11 13:16       |  1000 | 68,192K |     24 |  95,829K |    3 |      92K |   4 |  4,530K |
   sid |   7 | @ 2022-01-11 13:16       |  1000 | 68,192K |     25 |  97,001K |    3 |      92K |   4 |  1,264K |
   sid |   8 | @ 2022-01-11 13:16       |  1000 | 68,192K |     26 |  98,455K |    3 |      92K |   4 |  1,546K |
   sid |   9 | @ 2022-01-11 13:16       |  1000 | 68,192K |     29 | 101,314K |    3 |      92K |   6 |  2,951K |
   sid |  10 | @ 2022-01-11 13:16       |  1000 | 68,192K |     30 | 101,886K |    4 |     665K |   4 |    665K |
   sid | all |                          |       |         |     57 | 102,718K |   57 | 102,718K |     |         |

Aborting; cleaning up...
Aborted. Bye.

gchen · 11 January 2022 14:13

First, you only needed -hash in your small test because the script ran so fast. If the backup takes less than a second and then you modify files immediately after, the next backup won’t be able to detect changes because the timestamps appear to be the same. This issue never occurs in real world if you give Duplicacy a reasonable workload.

Second, your definition of duplication inefficiency is kind of unusual. If you understand Duplicacy’s pack and split approach, you’ll know that all your 1000 files are packed and split into around 20 chunks. And now you’re changing 100 files. Most of these chunks will be affected.

I’m sure the numbers will look much better if you choose a much smaller chunk size and at the same time get rid of the -hash option. For example, duplicacy init -c 1k. No need to specify -min and -max. I also suspect that you didn’t start with a new storage every time – the chunk size parameters can’t be changed after the initialization and if you run init multiple times these parameters will be ignored.

Droolio · 11 January 2022 14:30

It’s very relevant. Your test is basically constructed to generate a worst case scenario (in the sense that Duplicacy is expected to find changes at the 1 byte level) and compounding it with -hash - which makes the outcome even worse.

Totally understood you need it for your application, but in this test scenario you need to make it abundantly clear that that’s what you wanna do, since de-duplication efficiency is clearly impacted by -hash - especially when used with extreme chunk size variations - and de-duplication efficiency is the topic at hand!

Did you read my suggestion to try a more reasonable average like 4096? IMO, you simply won’t get what you’re looking for by specifying such extreme min and max chunk sizes.

Furthermore, some of your tests don’t appear to match the parameters you’ve set:

n5029fgm:

Here’s a test with -hash and duplicacy init -c 512 -min 1 -max 10M ... (takes maybe 15+ minutes). Duplication ineffiency: ~954,000 (yes that’s right, storing 1 million times more than strictly necessary, per incremental change):
Average chunk size: 4194304
Maximum chunk size: 8388608
Minimum chunk size: 1

Why second test are same parameters as first?

Here’s a test run using -hash -c 1k

Generating initial data...    1000 of    1000

Initialising duplicacy repository
/home/darren/Desktop/test/ro will be backed up to /home/darren/Desktop/test/storage with id sid
Compression level: 100
Average chunk size: 1024
Maximum chunk size: 4096
Minimum chunk size: 256
Chunk seed: 6475706c6963616379

Making single-byte changes 100 files at random, then backing up. Repeating 10 times.

Running 'duplicacy check ...'

  snap | rev |                          | files |    bytes | chunks |    bytes |   uniq |    bytes |    new |    bytes |
   sid |   1 | @ 2022-01-11 06:19 -hash |  1000 | 130,191K | 112029 | 139,156K |    287 |     351K | 112029 | 139,156K |
   sid |   2 | @ 2022-01-11 06:21 -hash |  1000 | 130,191K | 112026 | 139,156K |    207 |     233K |    295 |     359K |
   sid |   3 | @ 2022-01-11 06:21 -hash |  1000 | 130,191K | 112023 | 139,156K |    206 |     250K |    291 |     361K |
   sid |   4 | @ 2022-01-11 06:21 -hash |  1000 | 130,191K | 112023 | 139,156K |    170 |     206K |    274 |     350K |
   sid |   5 | @ 2022-01-11 06:21 -hash |  1000 | 130,191K | 112047 | 139,158K |    192 |     187K |    273 |     317K |
   sid |   6 | @ 2022-01-11 06:21 -hash |  1000 | 130,191K | 112041 | 139,158K |    195 |     197K |    258 |     291K |
   sid |   7 | @ 2022-01-11 06:21 -hash |  1000 | 130,191K | 112038 | 139,157K |    182 |     178K |    276 |     319K |
   sid |   8 | @ 2022-01-11 06:21 -hash |  1000 | 130,191K | 112052 | 139,159K |    188 |     175K |    272 |     294K |
   sid |   9 | @ 2022-01-11 06:21 -hash |  1000 | 130,191K | 112055 | 139,159K |    163 |     150K |    256 |     267K |
   sid |  10 | @ 2022-01-11 06:21 -hash |  1000 | 130,191K | 112057 | 139,159K |    219 |     252K |    219 |     252K |
   sid | all |                          |       |          | 114443 | 141,970K | 114443 | 141,970K |        |          |

Aborting; cleaning up...
Aborted. Bye.

Aside from the fact each test run is very randomised, results are pretty consistent. ~200K unique chunks per revision (roughly 2K per changed file) seems reasonably efficiency to me.

n5029fgm · 14 January 2022 06:09

@gchen

your definition of duplication inefficiency is kind of unusual

It’s the ratio between the amount of storage that duplicacy requires for incremental backups, and the theoretical minimum. This is probably the simplest, most sensible metric a person could use to measure the deduplication efficiency.

(Actually the theoretical minimum is slightly higher, because you’d have to store metadata, but it doesn’t make much of a difference)

Your test is basically constructed to generate a worst case scenario

My test is constructed to emulate the real-world application on a smaller scale. It does a pretty good job of that. I have replicated the same issue as I have in the real world application. It’s not a contrived test.

de-duplication efficiency is clearly impacted by -hash

It’s impacted by -hash because this is a test script which updates files so fast. I explained this previously. The -hash option is a red herring. It is necessary to supply -hash to this test to see how deduplication efficiency is affected. If you leave -hash out then not all files are backed up.

As @gchen said

First, you only needed -hash in your small test because the script ran so fast. If the backup takes less than a second and then you modify files immediately after, the next backup won’t be able to detect changes because the timestamps appear to be the same. This issue never occurs in real world if you give Duplicacy a reasonable workload

Yes, I said this earlier. That’s why -hash needs to be supplied, so that the test matches the real-world situation where the changes actually occur at different timestamps.

It would be useful to explain how duplicacy compares timestamps. Because if it only takes into account the minute resolution, this makes sense. If it looks at the seconds, milliseconds etc then this explanation does not make sense as the test does take almost a minute between each round of file changes.

In my application, changes happen more slowly and so do the backups. Each time, every change is recorded. Supplying -hash ensures that the test script records all changes, just like the real world application does.

Did you read my suggestion to try a more reasonable average like 4096?

I went above and beyond, and tried 512 bytes. That would increase storage efficiency further, at the cost of more time. I’m fine with that.

Some of your tests don’t appear to match the parameters you’ve set

The inputs/outputs I posted are correct. The duplicacy program does not let the user set parameters as they please. It presents the false illusion of choice. Then replaces my choice of 512 with 4194304. There’s nothing I can do about that. @gchen said

I also suspect that you didn’t start with a new storage every time

which is demonstrably false. The proof is in the source code of the script I provided. All the relevant subdirectories are created at the start of the script, and destroyed aftewards.

Here’s a test run using -hash -c 1k

Yep, this has better deduplication efficiency because the chunk size is smaller. As you see, I tried to set the chunk size smaller and duplicacy overwrote the average chunk size to 4M as you saw above. That’s a bug in the program and the documentation, related to what I complained about in my initial post. So how am I supposed to apply this? Randomly guess at parameter settings until I find one which duplicacy doesn’t ignore?

This is what I asked for in my original post-- let the user set the chunk sizes like the cli and documentation tell us we can. I already know that small chunk sizes will help alleviate the issue I’m having. I followed the documentation and attempted to set the chunk size. My setting was ignored and replaced with something else. That’s one of the issues I raised in my original post, albeit the details slightly different.

A smarter solution to this problem is to have files larger than the -min chunk size always be contianed in their own dedicated chunk.

n5029fgm · 14 January 2022 06:20

Also, setting -c 1k is not very practical on a repository which contains a mix of large and small files. So, there is a -min and -max setting. Good. I set the -min low and the -max high. Duplicacy should ‘smartly’ use smaller chunks on smaller files, and larger chunks on larger ones. I don’t know what the ‘average’ chunk size means, because the documentation doesn’t say. I have no way of knowing what chunk size settings are optimal when the way those settings are used is opaque. This is an issue I mentioned in my original post.

Droolio · 14 January 2022 18:14

ISTM you need to adjust your test script to flush/sync the file state and sleep at least 1s before the next backup. Using -hash is a hack here, since you simply won’t need it for most real world applications and it heavily distorts the outcome of the test.

Would also suggest setting a starting seed for all random calls so that the data set is deterministic and can be compared to subsequent runs. i.e. the same data, different parameters - you’ll never know what adjustments in the min/max chunk size might have when the backup size is wildly different each run.

Have you compared them? Because in my testing, efficiency is hardly improved and the runtime is significantly greater; that’s why I warned to choose a decent average based on your overall data rather than the smallest files you have.

If it was possible to achieve theoretical maximum efficiency by detecting 1 byte differences, the whole concept of rolling hashes and chunks would fly out the window. It’s unrealistic IMO.

But they’re clearly not. Dunno if it’s because you’re editing the wrong script, but I ran your test with varying chunk sizes and the input parameters properly match the output. At no point in my recent testing or usage has Duplicacy replaced or overwritten my choice with different numbers, so you should probably look into why it’s doing that in your environment. Perhaps add some debugging to the script and -v or -d to duplicacy calls might find the culprit?

n5029fgm · 15 January 2022 00:19

I modified the script and ran this test:

===================== Start of script ======================
                     myrootdir:  "/tmp/tmp.JuUsptkglu"
           CONFIG_NUM_DATAFILE:  "1000"
             CONFIG_NUM_BACKUP:  "10"
 CONFIG_NUM_CHANGES_PER_BACKUP:  "100"
   CONFIG_DUP_INIT_EXTRA_FLAGS:  ( "-c" "512" "-min" "1" "-max" "10M" )
   CONFIG_DUP_BACK_EXTRA_FLAGS:  ( "-threads" "4" )
CONFIG_SECS_WAIT_BETWEEN_BACKUP:  "1.1"

===================== Initialising... ======================
Generating initial data...    1000 of    1000

Making directories...
Running 'duplicacy init' with configured flags...
Initialising duplicacy repository
/tmp/tmp.JuUsptkglu/ro will be backed up to /tmp/tmp.JuUsptkglu/storage with id sid
Compression level: 100
Average chunk size: 512
Maximum chunk size: 10485760
Minimum chunk size: 1
Chunk seed: 6475706c6963616379
Hash key: 6475706c6963616379
ID key: 6475706c6963616379

Making single-byte changes 100 files at random, then backing up. Repeating 10 times.
Backup   1 in progress... sleeping for 1.1 seconds... done.
Backup   2 in progress... sleeping for 1.1 seconds... done.
Backup   3 in progress... sleeping for 1.1 seconds... done.
Backup   4 in progress... sleeping for 1.1 seconds... done.
Backup   5 in progress... sleeping for 1.1 seconds... done.
Backup   6 in progress... sleeping for 1.1 seconds... done.
Backup   7 in progress... sleeping for 1.1 seconds... done.
Backup   8 in progress... sleeping for 1.1 seconds... done.
Backup   9 in progress... sleeping for 1.1 seconds... done.
Backup  10 in progress... done.

Running 'duplicacy check ...'

  snap | rev |                          | files |    bytes | chunks |    bytes | uniq |    bytes | new |    bytes |
   sid |   1 | @ 2022-01-15 10:34 -hash |  1000 | 104,889K |     14 | 105,392K |    3 |      91K |  14 | 105,392K |
   sid |   2 | @ 2022-01-15 10:34       |  1000 | 104,889K |     15 | 108,267K |    3 |      92K |   4 |   2,966K |
   sid |   3 | @ 2022-01-15 10:34       |  1000 | 104,889K |     16 | 117,712K |    3 |      92K |   4 |   9,537K |
   sid |   4 | @ 2022-01-15 10:34       |  1000 | 104,889K |     17 | 127,221K |    3 |      92K |   4 |   9,601K |
   sid |   5 | @ 2022-01-15 10:34       |  1000 | 104,889K |     19 | 138,170K |    3 |      92K |   5 |  11,041K |
   sid |   6 | @ 2022-01-15 10:34       |  1000 | 104,889K |     20 | 146,589K |    3 |      92K |   4 |   8,511K |
   sid |   7 | @ 2022-01-15 10:34       |  1000 | 104,889K |     22 | 160,081K |    3 |      92K |   5 |  13,584K |
   sid |   8 | @ 2022-01-15 10:35       |  1000 | 104,889K |     23 | 168,357K |    3 |      92K |   4 |   8,368K |
   sid |   9 | @ 2022-01-15 10:35       |  1000 | 104,889K |     24 | 175,514K |    3 |      92K |   4 |   7,249K |
   sid |  10 | @ 2022-01-15 10:35       |  1000 | 104,889K |     26 | 190,397K |    5 |  14,975K |   5 |  14,975K |
   sid | all |                          |       |          |     53 | 191,228K |   53 | 191,228K |     |          |

Aborting; cleaning up...
Aborted. Bye.

The deduplication inefficiency is still awful. The ratio is about 96. And even with these minute changes, the repository size has doubled. Imagine making regular backups like that.

Why not split chunks across file boundaries? I can see no sensible reason why you’d want a chunk to spread across multiple files. The whole advantage of chunking is that you can record changes to files which are smaller than the files. If whole files changed, then you’d use more of a snapshot approach with references to different versions of a file.

This comes down to the design philosophy of the pack-and-split method which assumes a continuity between semantic changes and the raw byte changes in the lexicographically-ordered file list. The continuity assumption is related to the minimum chunk size.

If chunk boundaries were enforced between files, then the single-byte changes could be packed much more efficiently. Apparently a side effect of setting a constant chunk size -c X is to enforce chunk boundaries between files, but I haven’t tested that because it’s not appropriate for a repository with a wide range of file sizes, where the chunk size should be determined dynamically.

The -hash option is necessary to ensure that everything is backed up. I don’t backup data just to find out that it’s actually not all there when I want to restore, because some modification happened which didn’t touch the timestamp.

I notice that the -hash option seems to actually set the number of threads to 1 (based on it being bound by a single thread). Duplicacy accepts the -threads X option but silently ignores it.

With the -hash option:

===================== Start of script ======================
                     myrootdir:  "/tmp/tmp.k6UyTcToPn"
           CONFIG_NUM_DATAFILE:  "1000"
             CONFIG_NUM_BACKUP:  "10"
 CONFIG_NUM_CHANGES_PER_BACKUP:  "100"
   CONFIG_DUP_INIT_EXTRA_FLAGS:  ( "-c" "512" "-min" "1" "-max" "10M" )
   CONFIG_DUP_BACK_EXTRA_FLAGS:  ( "-hash" "-threads" "4" )
CONFIG_SECS_WAIT_BETWEEN_BACKUP:  "1.1"

===================== Initialising... ======================
Generating initial data...    1000 of    1000

Making directories...
Running 'duplicacy init' with configured flags...
Initialising duplicacy repository
/tmp/tmp.k6UyTcToPn/ro will be backed up to /tmp/tmp.k6UyTcToPn/storage with id sid
Compression level: 100
Average chunk size: 512
Maximum chunk size: 10485760
Minimum chunk size: 1
Chunk seed: 6475706c6963616379
Hash key: 6475706c6963616379
ID key: 6475706c6963616379

Making single-byte changes 100 files at random, then backing up. Repeating 10 times.
Backup   1 in progress... sleeping for 1.1 seconds... done.
Backup   2 in progress... sleeping for 1.1 seconds... done.
Backup   3 in progress... sleeping for 1.1 seconds... done.
Backup   4 in progress... sleeping for 1.1 seconds... done.
Backup   5 in progress... sleeping for 1.1 seconds... done.
Backup   6 in progress... sleeping for 1.1 seconds... done.
Backup   7 in progress... sleeping for 1.1 seconds... done.
Backup   8 in progress... sleeping for 1.1 seconds... done.
Backup   9 in progress... sleeping for 1.1 seconds... done.
Backup  10 in progress... done.

Running 'duplicacy check ...'

  snap | rev |                          | files |    bytes | chunks |     bytes | uniq |     bytes | new |    bytes |
   sid |   1 | @ 2022-01-15 10:47 -hash |  1000 | 107,084K |     14 |  107,595K |   13 |  107,595K |  14 | 107,595K |
   sid |   2 | @ 2022-01-15 10:48 -hash |  1000 | 107,084K |     14 |  107,595K |   13 |  107,595K |  13 | 107,595K |
   sid |   3 | @ 2022-01-15 10:48 -hash |  1000 | 107,084K |     14 |  107,595K |   12 |   97,314K |  13 | 107,595K |
   sid |   4 | @ 2022-01-15 10:49 -hash |  1000 | 107,084K |     14 |  107,595K |   12 |   97,314K |  12 |  97,314K |
   sid |   5 | @ 2022-01-15 10:50 -hash |  1000 | 107,084K |     14 |  107,595K |   12 |   97,314K |  12 |  97,314K |
   sid |   6 | @ 2022-01-15 10:50 -hash |  1000 | 107,084K |     14 |  107,594K |   12 |   97,314K |  12 |  97,314K |
   sid |   7 | @ 2022-01-15 10:51 -hash |  1000 | 107,084K |     14 |  107,594K |   12 |   97,314K |  12 |  97,314K |
   sid |   8 | @ 2022-01-15 10:52 -hash |  1000 | 107,084K |     14 |  107,594K |   11 |   87,033K |  13 | 107,594K |
   sid |   9 | @ 2022-01-15 10:52 -hash |  1000 | 107,084K |     14 |  107,594K |   11 |   87,033K |  11 |  87,033K |
   sid |  10 | @ 2022-01-15 10:53 -hash |  1000 | 107,084K |     14 |  107,594K |   13 |  107,594K |  13 | 107,594K |
   sid | all |                          |       |          |    125 | 1014,266K |  125 | 1014,266K |     |          |

Aborting; cleaning up...
Aborted. Bye.

Looks like it’s re-packing almost every chunk.

At no point in my recent testing or usage has Duplicacy replaced or overwritten my choice with different numbers, so you should probably look into why it’s doing that in your environment.

I have spent so much time debugging already. Even wrote scripts to reproduce the problems on realistic data. When will the developers debug their application?

I think I’m going to have to look at alternatives. There are so many problems with this application.

Flag syntax is incorrect (-repository should be --repository to adhere to standards)
Options are ignored or unimplemented (-threads X with -hash when using duplicacy backup)
Behaviour is inconsistent. (setting chunk sizes sometimes does, sometimes does not work)
Option -hash should be default to ensure that data is actually backed up (the raison d’etre of a backup solution)
Deduplication inefficiency is terrible (>1000 with necessary option -hash even when minimum chunk size is 1 byte)
Undocumented behaviors and options
No commands to extract machine-parseable metadata about a backup repository (e.g. should be able to output a CSV containing backup times, deduplication information etc)
Data meant for human-consumption is not properly documented (tabular columns are not described well enough)
Ability to ‘tag’ backups but no way to tag existing backups.
duplicacy check -stats implies -all and all revisions (can’t check individual ‘snapshots’)
Separate backup branches are called ‘snapshots’ even though the word ‘snapshot’ is universally used to refer to a copy of data taken at a particular instant in time.
Specifying a revision range -r 1-123 fails if there do not exist revisions with ids 1, 2, … all the way up to 123. (And with pruning many of them won’t exist)
Specifying a revision range requires user scripting to figure out the exact list of revision numbers, then you need to supply them all as commandline options -r <id>.
Commandline options are limited in number, so with a large number of backup revisions it is actually not possible to duplicacy check all revisions in one command. So the user must do further scripting. (commandline options for a program should not be unbounded in number if possible)
no manpage for duplicacy. Self-documentation of duplicacy is almost non-existent-- we have to come to the forum to see how things work. And even then, the behaviour of different options and such are not contained comprehensively in one thread.
…

Thanks for taking a look, but if there’s no intention of fixing problems with the application then you probably should say so from the outset, so we don’t waste our time.

I have explained and demonstrated thoroughly at this point the fundamental problem of the continuity assumption inherent in duplicacy’s pack-and-split method. It results in ludicrous duplication ratios. This is clear in theory, and I demonstrated in practice with reproducible scripts. Furthermore this has a real world effect on my actual application whose backup is occupying more storage space than it should. It’s up to you now whether you choose to do anything about these issues.

Goodbye.

n5029fgm · 15 January 2022 00:20

In case anyone wants to test with the latest version of the script:

#!/usr/bin/env bash

############### Configuration ################
              CONFIG_NUM_DATAFILE=1000
                CONFIG_NUM_BACKUP=10
    CONFIG_NUM_CHANGES_PER_BACKUP=100
  CONFIG_SECS_WAIT_BETWEEN_BACKUP="1.1"  # Seconds to wait between backups. Decimals are OK. Helps ensure that data changes result in different timestamps.
# Doesn't work.
#               CONFIG_RANDOM_SEED=123   # Set to blank string to not use a seed for $RANDOM
      CONFIG_DUP_INIT_EXTRA_FLAGS=( 
        "-c" "512" "-min" "1" "-max" "10M" )
      CONFIG_DUP_BACK_EXTRA_FLAGS=( "-hash" "-threads" "4" )
##############################################

# We will need this math function for the (B)erkeley (C)alculator program
BC_POWFUN="
define pow(a, b) {
    if (scale(b) == 0) {
        return a ^ b;
    };
    return e(b*l(a));
};
"

myrootdir="$(mktemp -d)"
cd "$myrootdir"

cleanup() {
  echo 1>&2
  echo "Aborting; cleaning up..." 1>&2
  if cd "$myrootdir"; then
    chmod -R ugo+w ro 2>/dev/null
    rm -r ro          2>/dev/null
    rm -r repo        2>/dev/null
    rm -r storage     2>/dev/null
    # Safely remove current working directory myrootdir
    rmdir "$(readlink -f "$myrootdir")"
  fi
  echo "Aborted. Bye." 1>&2
  exit 2
}

trap cleanup SIGINT

# Function for printing pretty lines
center_line() {
	#width="$(tput cols)"
	local width=60
	local padding="$(printf '%0.1s' ={1..500})"
	printf '%*.*s %s %*.*s\n' 0 "$(((width-2-${#1})/2))" "$padding" "$1" 0 "$(((width-1-${#1})/2))" "$padding"
}

center_line "Start of script"

# Print my vars
arr_var_names=( 
  "myrootdir" 
  "CONFIG_NUM_DATAFILE" "CONFIG_NUM_BACKUP" 
  "CONFIG_NUM_CHANGES_PER_BACKUP" 
  #"CONFIG_RANDOM_SEED" 
  "CONFIG_DUP_INIT_EXTRA_FLAGS" 
  "CONFIG_DUP_BACK_EXTRA_FLAGS" 
  "CONFIG_SECS_WAIT_BETWEEN_BACKUP" )

for var_name in "${arr_var_names[@]}"; do
	# Check if array
	if declare -p "$var_name" 2> /dev/null | grep -q '^declare \-a'; then
		declare -n tmp_arr="$var_name"
		printf "%30s:  ( " "$var_name"
		printf '"%s" ' "${tmp_arr[@]}"
		printf ')'
		echo
		unset -n tmp_arr
	else
		printf "%30s:  %s\n" "$var_name" "\"${!var_name}\""
	fi
done
echo

center_line "Initialising..."

#echo "Setting random seed..."
## Initialise RANDOM according to seed
#if [ -z "$CONFIG_RANDOM_SEED" ]; then
#	RANDOM="$CONFIG_RANDOM_SEED"
#fi

# Make random data in ./ro/
mkdir ro
printf "Generating initial data... %7s of %7s" 0 $CONFIG_NUM_DATAFILE
for ((i = 0; i < CONFIG_NUM_DATAFILE; i++)); do
  if [[ "$((RANDOM%10))" -lt "9" ]]; then
    # 9/10 chance - small file - 1K max
    file_size=$(bc <(echo $BC_POWFUN) -l <<< "pow(10,($((RANDOM%300))/100))")
    # Round to integer number of bytes
    file_size=$(echo $file_size | python -c "print(int(float(input()) + 0.5))")
    dd if=/dev/urandom of="ro/$RANDOM$RANDOM.txt" bs=1 count=$file_size &> /dev/null
  else
    # 1/10 chance - large file - 10M max
    file_size=$(bc <(echo $BC_POWFUN) -l <<< "pow(10,($((RANDOM%400))/100))" )
    # Round to integer number of KB
    file_size=$(echo $file_size | python -c "print(int(float(input()) + 0.5))")
    dd if=/dev/urandom of="ro/$RANDOM$RANDOM.txt" bs=1K count=$file_size &> /dev/null
  fi
  # Print every nth file created.
  # n = 1
  if [ "$(( i % 1 ))" == "0" ] || [ "$i" == "$(( CONFIG_NUM_DATAFILE - 1 ))" ]; then
    printf "\rGenerating initial data... %7s of %7s" $((i+1)) $CONFIG_NUM_DATAFILE
  fi
done
echo
echo

# Make data read-only
chmod -R ugo-w ro

# Make directories
echo "Making directories..."
mkdir repo
mkdir storage

# cd
cd repo

echo "Running 'duplicacy init' with configured flags..."
# Initialise backup
echo "Initialising duplicacy repository"
#printf '"%s" ' "${CONFIG_DUP_INIT_EXTRA_FLAGS[@]}" && echo
duplicacy init "${CONFIG_DUP_INIT_EXTRA_FLAGS[@]}" -repository "$(readlink -f ../ro)" sid "$(readlink -f ../storage)"
duplicacy -v info -repository "$(readlink -f ../ro)" "$(readlink -f ../storage)"

# Function to change N files at random
make_random_changes() {
  trap cleanup SIGINT
  local num_changes=$1
  local file_index
  local file_path
  local i
  local RAND
  for ((i = 0; i < num_changes; i++)); do
    RAND=$RANDOM$RANDOM$RANDOM # Big in case we have a lot of files...
    file_index=$((RAND%num_files))
    file_path="../ro/${file_arr[$file_index]}"
    chmod u+w "$file_path"
    echo -ne \\xef | dd conv=notrunc bs=1 count=1 of="$file_path" &> /dev/null
    #echo "$RANDOM" > "$file_path"
    chmod u-w "$file_path"
  done
}
file_arr=($(ls -1 ../ro))
num_files=${#file_arr[@]}

echo
echo "Making single-byte changes $CONFIG_NUM_CHANGES_PER_BACKUP files at random, then backing up. Repeating $CONFIG_NUM_BACKUP times."
for ((i = 1; i <= CONFIG_NUM_BACKUP; i++)); do
  make_random_changes $CONFIG_NUM_CHANGES_PER_BACKUP
  printf "Backup %03s in progress..." "$i"
  duplicacy backup "${CONFIG_DUP_BACK_EXTRA_FLAGS[@]}" > /dev/null
  [ "$i" != "$CONFIG_NUM_BACKUP" ] && echo -n " sleeping for $CONFIG_SECS_WAIT_BETWEEN_BACKUP seconds..." && sleep "$CONFIG_SECS_WAIT_BETWEEN_BACKUP"
  echo " done."
done
echo
echo "Running 'duplicacy check ...'"
echo
duplicacy check -id "sid" -tabular | grep -E "^...... [|] "

# Cleanup
cleanup