Prune command details

TheBestPessimist · 16 July 2018 17:37

The prune command has the task of deleting old/unwanted revisions and unused chunks from a storage.

Click here for a list of related forum topics.

Quick overview

NAME:
   duplicacy prune - Prune revisions by number, tag, or retention policy

USAGE:
   duplicacy prune [command options]

OPTIONS:
   -id <snapshot id>            delete revisions with the specified snapshot ID instead of the default one
   -all, -a                     match against all snapshot IDs
   -r <revision> [+]            delete the specified revisions
   -t <tag> [+]                 delete revisions with the specified tags
   -keep <n:m> [+]              keep 1 revision every n days for revisions older than m days
   -exhaustive                  remove all unreferenced chunks (not just those referenced by deleted snapshots)
   -exclusive                   assume exclusive access to the storage (disable two-step fossil collection)
   -dry-run, -d                 show what would have been deleted
   -delete-only                 delete fossils previously collected (if deletable) and don't collect fossils
   -collect-only                identify and collect fossils, but don't delete fossils previously collected
   -ignore <id> [+]             ignore revisions with the specified snapshot ID when deciding if fossils can be deleted
   -storage <storage name>      prune revisions from the specified storage
   -threads <n>			        number of threads used to prune unreferenced chunks

Usage

duplicacy prune [command options]

Options

Options marked with [+] can be passed more than once.

`-id <snapshot id>`

Delete revisions with the specified snapshot ID instead of the default one.

Example:

duplicacy prune -id computer-2

`-all, -a`

Run the prune command against all snapshot IDs in selected storage.

Example:

duplicacy prune -all

`-r <revision> [+]`

Delete the specified revisions.

Examples:

duplicacy prune -r 6              # delete revision 6
duplicacy prune -r 344-350        # delete revisions starting with 344 to 350 (included)
duplicacy prune -r 310 -r 1322    # delete only the revisions 310 and 1322

`-t <tag> [+]`

Delete revisions with the specified tags.

`-keep <n:m> [+]`

Keep 1 revision every n days for revisions older than m days.

The retention policies are specified by the -keep option, which accepts an argument in the form of two numbers n:m, where n indicates the number of days between two consecutive revisions to keep, and m means that the policy only applies to revisions at least m day old. If n is zero, any revisions older than m days will be removed.

Examples:

duplicacy prune -keep 1:7       # Keep a revision per (1) day for revisions older than 7 days
duplicacy prune -keep 7:30      # Keep a revision every 7 days for revisions older than 30 days
duplicacy prune -keep 30:180    # Keep a revision every 30 days for revisions older than 180 days
duplicacy prune -keep 0:360     # Keep no revisions older than 360 days

Multiple -keep options must be sorted by their m values in decreasing order.

For example, to combine the above policies into one line, it would become:

duplicacy prune -keep 0:360 -keep 30:180 -keep 7:30 -keep 1:7

`-exhaustive`

Remove all unreferenced chunks (not just those referenced by deleted revisions).

The -exhaustive option will scan the list of all chunks in the storage, therefore it will find not only unreferenced chunks from deleted revivions, but also chunks that become unreferenced for other reasons, such as those from an incomplete backup.

It will also find any file that does not look like a chunk file.

In contrast, a normal prune command will only identify chunks referenced by deleted revisions but not any other revisions.

Example:

duplicacy prune -exhaustive

`-exclusive`

Assume exclusive access to the storage (disable two-step fossil collection).

The -exclusive option will assume that no other clients are accessing the storage, effectively disabling the two-step fossil collection algorithm.

With this option, the prune command will immediately remove unreferenced chunks.

WARNING: Only run -exclusive when you are sure that no other backup is running, on any other device or repository.

Example:

duplicacy prune -exclusive

`-dry-run, -d`

This option is used to test what changes the prune command would have done. It is guaranteed not to make any changes on the storage, not even creating the local fossil collection file.

Example:

After running this nothing will be modified in the storage, but duplicacy will show all output just like a normal run:

duplicacy prune -dry-run -all -exhaustive - exclusive

`-delete-only`

Delete fossils previously collected (if deletable) and don’t collect fossils.

Example:

duplicacy prune -delete-only

`-collect-only`

Identify and collect fossils, but don’t delete fossils previously collected.

Example:

duplicacy prune -collect-only

The -delete-only option will skip the fossil collection step, while the -collect-only option will skip the fossil deletion step.

`-ignore <id> [+]`

Ignore revisions with the specified snapshot ID when deciding if fossils can be deleted.

`-storage <storage name>`

Prune revisions from the specified storage instead of the default one.

Example:

duplicacy prune -storage google-drive

`-threads <n>`

This option is used to specify more than one thread to prune chunks. This is generally useful to increase pruning speed.

You should test the best number of threads for your connection and storage provider but using more than 30 threads is unadvised as it will not improve speeds significantly.

Example

duplicacy prune -keep 1:7 -threads 10 # use 10 threads for the pruning process

Notes

Revivions to be deleted can be specified by numbers, by a tag, by retention policies, or by any combination of these categories.

Only one repository should run prune

Since encourages multiple repositories backing up to the same storage (so that deduplication will be efficient), users might want to run prune from each different repository.

The design of however was based on the assumption that only one instance would run the prune command (using -all). This can greatly simplify the implementation.

It also is a bit wasting the resources to have a prune command working on one repository id only, since it still needs to download all backups for all other repository ids in order to decide which chunks are to be deleted.

Finally, in theory race conditions can happen when two instances try to operate on the same chunk at the same time, but in practice it may never happen especially if the prune command runs after the backup so they will start at random times.

Cache folder is is extremely big!

Please read Cache folder is is extremely big! 😱.

Pruning is logged

All prune actions are logged by default locally, on the machine where the prune command is executed, under .duplicacy/logs. The prune logs are named similarly to prune-log-20171230-142510.

In the same folder you will also find log files which are empty. There is no need to worry if the files are empty as this means that in that particular prune operation, nothing was pruned from the storage.

`-exhaustive` should be used sparingly

The -exhaustive option is only needed when there are known unreferenced chunks in the storage, for example, when a backup is interrupted by user and terminated due to an error and the files in the repository change afterwards.

It is not recommended to run the prune command regularly with this option without a recent incomplete backup, mainly because if there is an ongoing backup from a different computer, the prune command will mark as fossils all new chunks uploaded by that backup.

Although in the fossil deletion step the prune command can correctly identify that these chunks are actually referenced and thus turn them back into chunks, the cost of extra API calls can be excessive.

The last revision can only be deleted in `-exclusive` mode

The latest revision from each repository can’t be deleted in non-exclusive mode because in theory it is possible that a backup for that repository may be in progress which will use the latest revision as the base, so removal of the latest revision would cause some chunks to be removed even though they are needed by the backup in progress.

Corner cases when prune may delete too much

There are two corner cases that a fossil still needed may be mistakenly deleted. When there is a backup taking more than 7 days that started before the chunk was marked as fossil, then the prune command will think the repository has become inactive which will then be excluded from the criteria for determining safe fossils
to be deleted.

The other case happens when an initial backup from a newly recreated repository that also started before the chunk was marked as fossil. Since the prune command doesn’t know the existence of such a repository at the fossil deletion time, it may think the fossil isn’t needed any more by any backup and thus delete it permanently.

Therefore, a check command must be used if a backup is an initial backup or takes more than 7 days. Once a backup passes the check command, it is guaranteed that it won’t be affected by any future prune operations.

Individual files cannot be pruned

Note that duplicacy always prunes entire revisions of entire snapshots, not of individual files. In other words: it is not possible to remove backups of specific files from the storage. This means, for example, if you realize after a couple of months, that you have accidentally been backing up some huge useless files, the only way to remove them from the storage to free up space is to prune each and every revision in which they are included.

Two-step fossil collection algorithm

The prune command implements the two-step fossil collection algorithm. It will first find fossil collection files from previous runs and check if contained fossils are eligible for permanent deletion (the fossil deletion step). Then it will search for snapshots to be deleted, mark unreferenced chunks as fossils (by renaming) and save them in a new fossil collection file stored locally (the fossil collection step).

For fossils collected in the fossil collection step to be eligible for safe deletion in the fossil deletion step, at least one new snapshot from each snapshot id must be created between two runs of the prune command. However, some repository may not be set up to back up with a regular schedule, and thus literally blocking other repositories from deleting any fossils. Duplicacy by default will ignore repositories that have no new backup in the past 7 days, and you can also use the -ignore option to skip certain repositories when deciding the deletion criteria.

Christoph · 24 August 2018 19:16

manne01 · 2 January 2020 18:16

-keep <n:m> [+]              keep 1 snapshot every n days for snapshots older than m days

If I understand nomenclature correctly, prune is deleting or “keeping” revisions, not snapshots!? So the description of the parameter “keep” should be

keep 1 revison every n days for revisions older than m days

Is that right? Can you update the command description? The whole nomenclature thing is still a bit confusing, I think this would help.

Same should be done in the command line version (and logs etc.)

towerbr · 2 January 2020 20:15

You are completely right.

I just updated the -keep option description and examples, as well as reviewed all other snapshot references here on this page, and updated them for revisions where applicable.

Thank you for pointing that out.

manne01 · 3 January 2020 07:47

Great, what about the wording in the CLI version and log outputs and so on?

towerbr · 3 January 2020 12:32

Just placed a PR about the help texts:

Now is up to @gchen to evaluate

I think log messages are ok, isn’t it?

TRACE SNAPSHOT_LIST_IDS Listing all snapshot ids
TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot 11111
TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot 22222
TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot 33333
TRACE LIST_FILES Listing chunks/
INFO CHUNK_DELETE The chunk 00031e493bf0a5c0eb3286656e39b5d686fe57f78c572d947b9b33369539f828 has been permanently removed
INFO CHUNK_DELETE The chunk 001948f5dfb1793f243133c8f40f8ae0c839c98d04733edc4e1807892a1673c1 has been permanently removed

tangofan · 17 March 2020 08:03

I am reading this to mean that, if I run prune twice in a row (without any snapshot/revision created in between), those collected fossils just won’t be deleted (but otherwise nothing bad will happen). Is my understanding correct? Or does running prune twice in a row do bad things?

Is it possible to to change (increase) that ignore time, e.g. set it to 10 days or 30 days?

gchen · 18 March 2020 01:45

This is correct.

No, currently it is fixed at 7 days but this can be changed easily.

tangofan · 18 March 2020 02:36

Thanks so much for your response.

I’m wondering, if leaving it at 7 days is a bit dangerous. I’m thinking of a situation where someone shuts down their computer(s), goes on a 2 week vacation, comes back and just boots the computer(s) up, when coming back. If they have a situation of two backup ids backing up into the same storage and they run a backup+prune job for backup id 1 before the backup job for backup id 2 runs, then it sounds like chunks could get deleted, even though they are still needed by backup id 2.

BTW is this 7 day ignore time the reason why the recommendation for B2 is to set “Keep prior versions for this many days” to a value >7?

gchen · 18 March 2020 15:20

Two backups done more than 7 days apart would not be a problem. Instead, a backup taking more than 7 days would be a problem, but you can always run a check after such a backup to make sure that the prune job didn’t accidentally delete a chunk that this backup is dependent on.

carlos · 16 May 2020 21:35

@TheBestPessimist
I need to delete everything that is older than 90 days on my backups. What commands should I be using? This is what i’m currently pruning, but it has not freed space.
[-keep 0:90 -keep 7:30 -keep 3:10 -a]

saspus · 16 May 2020 21:57

Your command would remove snapshots older than 90 days. Whether this leads to freeing space depends on how many chunks are there that are only used by snapshots older than 90 days. Maybe none. This is highly correlated with your data turnover rate

Imagine if your data set does not change; its a static set of unchanging files – like a photolibrary. Then pruning old snapshots will not free any space, (and making new backups won’t take any new space either).

if by “everything” you refer to files – then it is not possible. Snapshots are immutable, as they should be.

Droolio · 17 May 2020 15:33

Note also, you won’t free the space those old snapshot take up until the second run of prune, since pruning is done in two steps - collection then delete. A subsequent backup on all your repositories needs to happen before the second run of prune.

carlos · 18 May 2020 20:06

Got it. I have a folder that I store an entire photo session, and after 90 days the selected files are already in another location, this entire photo session folder just serves as a backup if the client want something I haven’t delivered. So 90 days is a good amount of time for me.

carlos · 18 May 2020 20:06

If I understand right every time I delete these folders I need to set a prune for IE 7 days max, then after 7 days this folders will be gone after 2 prunes and a backup?

Droolio · 18 May 2020 23:01

I think the 7 day thing is if you don’t run any backups for that long. Otherwise, the first prune will delete the snapshots and ‘collect’ all the relevant chunks - ready for their deletion in the second run. But according to your prune command, those files - if you delete them - won’t be gone for at least 90 days.

simonrozet · 14 October 2020 13:50

What is the behaviour of the prune command when there are more than one backup on a given day? (e.g. hourly backup)

Droolio · 15 October 2020 14:16

Pruning works at the resolution of a day, so it doesn’t make sense to run it more than once a day. Otherwise, it works fine with hourly backups.

Usually, you’d set up the retention period to keep hourly backups for at least a day - perhaps a week or two, or more - so recent backups don’t get touched until they’ve ages awhile.

leerspace · 16 October 2020 01:41

If I understand the question correctly… if you run the prune command to only keep 1 backup per day and you have hourly backups, then it’ll keep the oldest revision from the day (based on the thread linked below).

simonrozet · 20 October 2020 08:18

Yep, that’s what I was after. Thank you!

Prune command details

Quick overview

Usage

Options

-id <snapshot id>

Example:

-all, -a

Example:

-r <revision> [+]

Examples:

-t <tag> [+]

-keep <n:m> [+]

Examples:

-exhaustive

Example:

-exclusive

Example:

-dry-run, -d

Example:

-delete-only

Example:

-collect-only

Example:

-ignore <id> [+]

-storage <storage name>

Example:

-threads <n>

Example

Notes

Only one repository should run prune

Cache folder is is extremely big!

Pruning is logged

-exhaustive should be used sparingly

The last revision can only be deleted in -exclusive mode

Corner cases when prune may delete too much

Individual files cannot be pruned

Two-step fossil collection algorithm

`-id <snapshot id>`

`-all, -a`

`-r <revision> [+]`

`-t <tag> [+]`

`-keep <n:m> [+]`

`-exhaustive`

`-exclusive`

`-dry-run, -d`

`-delete-only`

`-collect-only`

`-ignore <id> [+]`

`-storage <storage name>`

`-threads <n>`

`-exhaustive` should be used sparingly

The last revision can only be deleted in `-exclusive` mode