Hi, there are two significant problems I have with duplicacy which I think should be corrected in the upcoming overhaul.
-
Duplicacy cannot backup a read-only directory. It’s a bit crazy that this is even an issue. Anyway, there is a workaround for this, which is needlessly complicated. Assume that we want to backup
/ro/
which is a read-only directory. This is what we have to do:
- Create a temporary directory
tmp
. -
cd
intotmp
. This will be the location we backup. - For the directory you actually want to backup, make a symlink to each and every item inside that directory. Place those symlinks inside this directory (
tmp
). - If the directory you want to backup contains symlinks at the top level, then you need to instead create symlinks that point to the same targets. (duplicacy follows symlinks in the top level).
- Run the backup.
- Delete the temporary directory,
cd .. && rm -r tmp
.
Note that the -pref-dir
option doesn’t help. It still tries to write to the current directory, and you’re only allowed to backup the current directory (for some reason?).
In bash, to transparently accomplish this goal of backing up a read-only directory and cleaning things up properly, we need to do something like:
TMP_DIR="$(mktemp -d)"
cd "$TMP_DIR"
-
<
complicated loop to go through each of/ro/*
, determine if it is a symlink, and either make a symink to it, or make a symlink that points to the same place, if it is a symlink.>
duplicacy init ...
duplicacy backup ...
cd - 1> /dev/null
rm -r "$TMP_DIR"
unset TMP_DIR
The CLI interface should explicitly support read-only directories (why would we assume that the directory we want to back up is writeable…? Makes no sense to me), and an option to select the location of directory we are backing up. I.e. it should be as simple as one command:
-
duplicacy backup ---source /ro/ --storage /wherever
.
I would suggest following the design of duplicity
for the CLI.
- The second problem is the chunk size limitation. The software presents a false illusion that we have control over chunk size. I tried to set the minimum chunk size to
100 bytes
, as many of my files are only about this size. Alas,duplicacy
does not indicate any kind of error, but silently replaces my choice with1 MB
. This is totally unworkable when backing up a large number of small files. There is also an opaqueness about how the chunk sizes are determined. What is the “target” or “average” chunk size? If I set a minimum chunk size of 0 for instance, then I expect duplicacy to make a chunk no larger than a file. I.e. a chunk would boundary would always exist between files. i.e. there’s no minimum chunk size, so set it as small as needed for optimum storage efficiency. A minimum chunk size is specified for performance reasons, right?
There is a comparison of some different chunk sizes between software here: Tweaking the chunker block size targets · Issue #1071 · restic/restic · GitHub
My use case is this:
I want to backup a read-only directory which contains a lot of files with file sizes that vary dramatically between 100B
and about 25MiB
. These files are named with random hashes, and modifications to these files are effectively random. The way that duplicacy performs packing and splitting (according to a forum post whose link I lost) is to order the files lexicographically by name or path, then sequentially create chunks. Because the files in my project are randomly modified and named, a small change to the data of the project corresponds to changes in basically random locations within this long list of filenames. So, roughly speaking, every modified file results in a modified chunk. So each file change can result in almost 4 MiB
of additional chunks being created. Even when those files are miniscule, like 100B
.
Example from the -tabular
output of duplicacy, where small changes in the repository size (indeed, only small changes in the project) are resulting in massively inflated backup size: (I initialised this repository with default chunk size settings)
snap | rev | | files | bytes | chunks | bytes | uniq | bytes | new | bytes |
...
project | 363 | @ 2021-12-28 23:45 -hash | 1027 | 986,584K | 219 | 881,910K | 0 | 0 | 0 | 0 |
project | 364 | @ 2021-12-29 00:32 -hash | 1041 | 987,007K | 219 | 882,062K | 8 | 25,987K | 19 | 80,601K |
...
project | 400 | @ 2021-12-31 19:00 -hash | 1158 | 989,490K | 232 | 884,217K | 0 | 0 | 0 | 0 |
project | 401 | @ 2021-12-31 20:00 -hash | 1217 | 1000,750K | 231 | 894,744K | 37 | 204,203K | 59 | 300,956K |
project | 402 | @ 2021-12-31 21:00 -hash | 1319 | 1030,982K | 242 | 920,021K | 0 | 0 | 96 | 449,702K |
...
project | 423 | @ 2022-01-01 18:00 -hash | 1319 | 1030,982K | 242 | 920,021K | 0 | 0 | 0 | 0 |
project | 424 | @ 2022-01-01 19:00 -hash | 1327 | 1034,893K | 241 | 923,528K | 0 | 0 | 16 | 78,335K |
project | 425 | @ 2022-01-01 20:00 -hash | 1327 | 1034,893K | 241 | 923,528K | 0 | 0 | 0 | 0 |
project | 426 | @ 2022-01-01 21:00 -hash | 1367 | 1040,162K | 236 | 928,003K | 0 | 0 | 43 | 228,948K |
I would also note at this point that the column headers of -tabular
are not descriptive enough and I’m left guessing about what they mean. I think(?) that the third bytes
column indicates the size of unique chunks added in the given revision, and adding this column up gives the total backup size including all revisions.
Note how from revision 363 --> 364
there was an addition of 14
files, adding up to about 500KB
. Yet, the additional chunks took up 26MB
. Then from 400 --> 401
there was an addition of files adding up to about 10MB
, but this took up 200MB
. There is also some file renaming and moving and edits going on too, but mostly it is additions. When I look through the list of files in the project, it is clear that disparate files are being modified each time. I.e. visible from the last modified
dates from the output of ls -lahF
: (this particular directory created on Dec 27
)
-rw-r--r-- 1 root root 974K Dec 27 18:59 ac2caef427ebcd02079eb67f0c57b0888a18f3be3536d6992c8c9d5320f9a50e
-rw-r--r-- 1 root root 1.8M Dec 27 18:58 ac40a0e0cf75a7ff62834305373d36537cb1f5e409f78f778e3e518e23baa016
-rw-r--r-- 1 root root 1.1M Dec 27 18:59 ac473e8f176cf0ba245463d6c0fb6a02e08c974cb844031406754157283e7a9b
-rw-r--r-- 1 root root 1.8K Dec 27 18:59 ac6e2e5bbb943713f459b5a728462af7fdac6e9d5f5d48a2603903002e6bd243
-rw-r--r-- 1 root root 2.4K Dec 27 19:00 ac77a7fffd477f20c6fc20f2e93cb7d6f931a988db301e5d09c90a810019033f
-rw-r--r-- 1 root root 3.1M Dec 27 18:59 acce9b93f87ca6376ec1918f0e06cbe3539d0d717993264de39a992a7601edf2
-rw-r--r-- 1 root root 350K Dec 27 18:59 ad0df003c8cfd0edf85b0dcbfcd66635f006e75c78c087169ad435ffb0ed4a5d
-rw-r--r-- 1 root root 357 Dec 27 19:01 ad6e4aa7a0d45b974c8305d8f591ee537280ef86c0c64dc18f5ef453139e53a3
-rw-r--r-- 1 root root 4.0K Dec 27 19:01 ad95cb52efc18e9f7ae824b6aa3a2eed5b43d891dcd8301eb0c6838b6bf90409
-rw-r--r-- 1 root root 2.3K Dec 31 20:40 adc13edc09055c4d45475ef364bbf9289621aefb868cb3441a4efe9d4e6db024
-rw-r--r-- 1 root root 1.6M Dec 27 18:59 aded6ff3d9ce8ed6ed82fc409ba5351dcefa3ceabbab13a0a5d051cd63f3dbd9
-rw-r--r-- 1 root root 14K Dec 31 20:51 ae03b81f8389373dc23d599ccb700c7b189832861885ea28dcab31c1ede1b2df
-rw-r--r-- 1 root root 475 Dec 29 01:11 ae23817e0c2394b08a93383c9eced9bd280230365513be00f446bb152268579d
-rw-r--r-- 1 root root 1.6M Dec 27 19:00 ae4eb5cbf12f4a0ea16112a5e30e8ce4b85bde7921ff30dcdfecd5fef93cbbf5
-rw-r--r-- 1 root root 1.6M Dec 27 18:57 ae9166e3d5e66aefd21ccf18caa5de191aa05535181d5bd092a73d7265805f68
-rw-r--r-- 1 root root 361 Dec 27 18:57 ae9d02fbeb0df1a68c9c963cd79ba81170ac0d52fab4593c22e1d2b519111795
-rw-r--r-- 1 root root 1.6M Dec 27 18:59 aeb0370a0c641b90d693e2ebd45287bd992271bdc94208595c925de1864f41be
-rw-r--r-- 1 root root 4.9K Dec 27 19:00 aeb580669314297fd346b8542c108dadf460e059765e1097bd957114f8722648
-rw-r--r-- 1 root root 589 Dec 29 00:36 af8dbcbd3001600faab2b07cf7c21e3d557e379cce900ee12cf32cd5a414453f
-rw-r--r-- 1 root root 672 Dec 31 20:40 af8f2b245074311b42737a6600b72d9e5205a6f902dc87265e3b9f2ec75af23e
-rw-r--r-- 1 root root 13K Dec 29 01:18 afadf326fbeb3b7f3c84e243513aeafcd1e8dd02c8694d9a9e22402eeffeb8e1
-rw-r--r-- 1 root root 1.6M Dec 27 18:58 afec7fbc0175216d06ea75e8955acaa1b22d5f47bde119744587a1a0cba7fa22
-rw-r--r-- 1 root root 387K Dec 27 19:00 b0677d76ee7277ccd4cfbcc36e2e8a9452d423c37fc8c75350f60e8fa5661215
-rw-r--r-- 1 root root 1.5M Dec 27 18:59 b0dd4acb30e75b8622a1acc5b08f1ccd05a2c1acf62d44000675e9d85a9a9e08
-rw-r--r-- 1 root root 797K Jan 1 18:23 b13e0250436fc9212e1de885ca0859782e644d664ce3e0a4bb2232e171a0f760
-rw-r--r-- 1 root root 844K Dec 27 19:00 b145132d21af3d5d8d0df5d6591c09d93c54cec29d4556db26c7d655c00fab99
-rw-r--r-- 1 root root 794K Dec 27 18:57 b188266f05cd3a4497970530cd22957e3565ad15888b8480398e2da31a71119b
-rw-r--r-- 1 root root 2.4K Dec 31 20:32 b188cb066357ba476b4e9aa6e2e3e138626b82dbefeba5edf0ae2e934cc6896f
-rw-r--r-- 1 root root 358 Dec 27 19:01 b1e617f96fb8c581b4f4837de4f5a306a02f946f808ca3b28dd7cd46ec1aa622
-rw-r--r-- 1 root root 150K Dec 27 18:57 b26f5f3d8522186283193143c933a918748ae9b921ff101e8122d498070f1210
-rw-r--r-- 1 root root 1.8M Dec 27 18:58 b283f5bfb82a484cefce1923e5df5cb976c2bd7cbed9237db17a404e5e98de47
-rw-r--r-- 1 root root 1.6K Dec 27 19:00 b2c167c3548929ac53b4444c3fcf55fcfe21fef9a70e2cd4c42fc64805509458
-rw-r--r-- 1 root root 2.7M Dec 27 18:59 b2c26bf899feefaa4683f1be8da448a5950980549a79f6557a2b98e26560bcc6
-rw-r--r-- 1 root root 826 Dec 29 00:35 b2db27a98da1cfd118633995e0477ad96980ea14e0ee7e3019fdfe2f588062bd
Whenever I make just small edit to the project, the chunk sizes are massive. It results in inflated backup size:
snap | rev | | files | bytes | chunks | bytes | uniq | bytes | new | bytes |
...
project | 426 | @ 2022-01-01 21:00 -hash | 1367 | 1040,162K | 236 | 928,003K | 0 | 0 | 43 | 228,948K |
project | all | | | | 1024 | 4,341M | 1024 | 4,341M | | |
At a mathematical level, this problem would not exist if duplicacy didn’t assume rely on the assumption that this map is continuous f : <semantic changes to files> --> <mask of alterred bits, ordered lexicographically by filename>
. I.e., small changes to a project do not imply that a close cluster of bits change. Small changes can result in sparse changes to the data, and that is the case for me. The chunking method usually seems to handle this okay, but it assumes a certain ‘lower bound’ on continuity-- i.e. the smallest cluster of changed bits is at least the average chunk size. In my case, the lowest allowed chunk size 1M
is about ten thousand times larger than the more reasonable choice of 100B
.
Maybe there would be untenable performance implications if chunks were so small. That’s fine, but then why not allow chunks to be as small as they need so that there’s at most one complete file in a chunk?That’s what I assumed the minimum chunk size setting was for. But it doesn’t work. (silently replaces choice with 1MB
minimum). Most of the repository data would be contained in large chunks, but small chunks might outnumber the big ones.