Prune consumes 150+ GB of space locally

scottw · 24 May 2019 12:24

I have been very impressed with Duplicacy. While it does take a little while to get used too since the documentation is a little hit and miss (because as versions and features change, the older documentation doesn’t get updated) I have managed to recover what I needed from backups without any errors.

The main beef is Pruning. If I never prune, all is fine, backups are fast and life is good. But if I prune only once, then the .duplicacy folder grows to outrageous sizes (154GB) until I run a prune. But I hate to prune very often because it takes hours upon hours upon hours upon hours to prune. Often I will have to kill the process (I use CLI) and start it again. Only have multiple times will it finally finish and in most cases will clean up the local .duplicacy folder.

I do backups to B2 and I am running the latest 2.2.0 version and it does the same thing as previous versions. What exactly is it doing that takes so long? And what does .duplicacy directory have to grow in size if you only run the prune command once?

Here is my prune command:

prune -keep 0:365 -keep 30:70 -keep 7:14 -keep 1:1

My /etc directory is 44M. But my /etc/.duplicacy folder is 15GB? Why does it need to use so much space to backup 44M of data?

TheBestPessimist · 24 May 2019 13:12

To speedup pruning you can now use -threads option, see the reference: Prune command details.

After reading in the docs (Cache usage details), i wouldn’t expect for the directory to grow that big.

Can you show us the sizes of the folders inside?

scottw · 24 May 2019 13:42

After digging around the forums more, I realize that I was storing everything in a single b2 bucket. So when I went to prune /etc, it was also factoring in some massive backups for website data, mail data, etc. My theory is that by creating a new bucket and using the new 2.2.0 feature of storing the content of /etc in a b2 bucket directory of /etc, then I would limit pruning to only the /etc directory and not having to deal with rather large directories like web site and mail data.

Does this sound like logical duplicacy thinking?

TheBestPessimist · 24 May 2019 17:53

I don’t know exactly how that works :-?, but what you could do is create a new repository in which you don’t store anything, but just use it for regular pruning.

Would this work for you?

Why am i suggesting for this and not for splitting the 2 repos? Because when you will get to pruning the “website data, mail data, etc.” you will get in the same trouble you’re having now.

(btw, did you try to use -threads?)

scottw · 24 May 2019 18:15

Well, my beef is not so much with the slowness, but my /etc directory a week ago had a chunks directory of 154GB for a 44M directory backup. I also had 154GB on my websites directory. That’s a lot of space for just duplication of chunks on each area I want to backup.

I haven’t tried threads yet but will do so a little later today.

TheBestPessimist · 24 May 2019 18:19

Wait, why is there a chunks folder in /etc?
Do you mean the chunks folder inside the cache folder?

Then i’ll just repeat the first request:

scottw · 24 May 2019 18:24

I deleted it, to free up space and use a different bucket for it. But when I went into /etc/.duplicacy/cache folder, there was I think a log directory that had 44M, and zombiechunks or whatever it is, that had basically nothing in it, but in the chunks folder was 99.99% of the 15GB. But last week before I pruned, it was 154GB.

TheBestPessimist · 24 May 2019 18:26

@gchen could you please add your input on why the /cache/chunks/ directory could grow so large?

scottw · 24 May 2019 18:29

I have no clue. But when I was digging around the forums after I posted… I found that if you store /etc, /root, /var/www/vhosts, /var/qmail/mailnames, and other separate backup directories and point them all to the same B2 bucket, that when you prune, it has to process all the chunks for all the different backups, including the massive vhosts and mailnames stores. Thus, when you go to prune, it basically is downloading the entire set for each area that I prune, which is why vhosts and etc both had 154GB chunks folders in each. Which is why it was taking hours upon hours upon hours upon hours (and so forth) to run and often had to be force quit and relaunched numerous times.

TheBestPessimist · 24 May 2019 18:33

It’s not downloading them, but it has to list them (similar to dir or ls) (so that it knows all the chunks in the storage, and all the chunks needed by each snapshot) in order to decide what to prune.

Imo that should not occupy much space though.

scottw · 24 May 2019 18:36

The entire bucket was only 800GB of everything combined. Something was messed up. Which is why I am starting over and separating them out. My DB was separate bucket from everything else, and so I will test the -threads option on it once this upload finishes up. Because it was just as slow.

gchen · 25 May 2019 03:20

If you run prune from /etc then it still needs to download all existing snapshots from the storage (including those from other repositories). The /cache/chunks/ directory only stores meta chunks that make up these snapshots, but if there are many revisions and if some repositories contain large numbers of files, then the /cache/chunks/ directory can grow to a surprisingly large size.

It is recommended to run prune from only one repository, but you need to add the -a option:

prune -a -keep 0:365 -keep 30:70 -keep 7:14 -keep 1:1

This is from Cache usage details

At the end of a backup operation, Duplicacy will clean up the local cache in such a way that only chunks composing the snapshot file from the last backup will stay in the cache and all other chunks will be removed from it. However, if the prune command has been run before (which will leave a .duplicacy/collection folder in the repository), then the backup command won’t perform any cache cleanup and instead defer that to prune.

So if you only run prune once in a while, you can remove the .duplicacy/collection folder (if it is empty).