Prune does not delete snapshots

dreamflasher · 2 July 2023 09:53

Please describe what you are doing to trigger the bug:
I run duplicacy.exe prune -id laptop -keep 365:365 -keep 30:90 -keep 7:10 -keep 1:2 -threads 64

Please describe what you expect to happen (but doesn’t):
Delete old snapshot according to the criteria.

Please describe what actually happens (the wrong behaviour):
It writes a long list of Deleting snapshot laptop at revision 1038. But when I run the command again it writes the same output. Also when I run the list command, it still shows these revisions.

saspus · 2 July 2023 21:12

What is the storage backend?

When duplicacy says “deleting snapshots” it does not actually delete then at that time; it puts them to the list to delete in the end; the better message would have been “collecting snapshot”. If then actual deletion process fails — the snapshots will remain.

I have seen such failures with google drive if the account is shared with another account in any way but admin.

dreamflasher · 2 July 2023 21:31

The storage backend is b2, and pruning worked fine in the past.
I just noticed that the prune command finishes with Chunk 207ec450361dc466ffa210b1d1d71ce0a3e9cae0f6863514323cef5aa5b05419 does not exist in the storage. Which is not indicated as an error, but I guess it is?
How do I recover from this error?

dreamflasher · 2 July 2023 22:20

I’ll follow the advice here: Fix missing chunks

saspus · 3 July 2023 02:22

No, don’t.

I think what happened was one of previous prunes was interrupted, after it deleted chunks but before it deleted snapshots. Now those snapshots are missing chunks, but in reality they should have been deleted and not exist in the first place. Repairing them makes no sense — they will be pruned again anyway. This bug has been reported many times and solutions were suggested (e.g. here). It has not been fixed yet.

I would run check with persist, take a note of failed snapshots, delete them manually from storage (see snapshots folder, sub folder named after a name of a snapshot I’d) and then clean orphans by running prune with -exhaustive flag.

dreamflasher · 4 July 2023 18:45

added a feature request for it: Feature Request: -persist flag for 'prune' that deletes snapshots no matter if there are missing chunks

a.alberti82 · 16 November 2024 16:33

Is there a simple way to remove snapshots when chunks are missing? I cannot believe this cannot be done. I read the page Fix missing chunks but it does not help because there is no way I reconver those missing chunks. What is the alternative? Throw away the entire storage after several years of backups?

saspus · 16 November 2024 17:02

Go to snapshots directory on the storage and delete the specific revision files. Then run exhaustive prune to clean up orphaned chunks.

But first elaborate on your config. Chunks don’t just vanish. If they do — change your storage provider.

a.alberti82 · 16 November 2024 18:14

Thank you for the super fast reply. Your suggestion makes a lot of sense. I use backblaze and removed from the snapshot folders those snapshots for which one or two (typically no more than 3) chunks were missing. There were quite a few compromised snapshot revisions scattered through several backup ids.

I believe the origin of the problem is what has been described in other posts. A prune command was started and chunks were deleted, but then the prune command was stopped before the snapshot revision was deleted, leaving behind a snapshot revision that cannot purged anymore. Do you think it is a possible explanation? I don’t think chunks disappear just like that from backblaze.

My problem now is that when I run:

❯ duplicacy check -all -threads 200
Repository set to /Users/my_home
Storage set to b2://my-backup
Download URL is: https://f002.backblazeb2.com
Listing all chunks
17 snapshots and 1046 revisions
Total chunk size is 506,311M in 130956 chunks
All chunks referenced by snapshot xyz1 at revision 18 exist
All chunks referenced by snapshot xyz1 at revision 20 exist
All chunks referenced by snapshot xyz2 at revision 121 exist
All chunks referenced by snapshot xyz2 at revision 157 exist
All chunks referenced by snapshot xyz2 at revision 193 exist
All chunks referenced by snapshot xyz2 at revision 229 exist
All chunks referenced by snapshot xyz2 at revision 262 exist
All chunks referenced by snapshot xyz2 at revision 284 exist
All chunks referenced by snapshot xyz3 at revision 1 exist
All chunks referenced by snapshot xyz3 at revision 1030 exist
All chunks referenced by snapshot xyz3 at revision 1375 exist
All chunks referenced by snapshot xyz3 at revision 2814 exist
All chunks referenced by snapshot xyz3 at revision 2815 exist
All chunks referenced by snapshot xyz3 at revision 2816 exist
All chunks referenced by snapshot xyz3 at revision 2818 exist
Chunk 49305662730665976bd5b4c1454905b3a2d0c5a8953400d4a0e7e796c89df0b1 can't be found

it breaks and does not show where the chunk is missing. Perhaps in revision 2818?

Do you have any idea what I could try? Thanks a lot!

saspus · 16 November 2024 19:21

Yes, this makes sense.

Add -persist to check, it will then not give up on first failure.

Droolio · 16 November 2024 21:32

Probably the revision after that (2819 or whatever incremental revision number exists in that snapshot directory immediately after 2818). Adding the -v ‘global’ option may give more info, but it’s probably talking about the revision after.

Actually, -persist may not always work, but it depends if it’s a metadata chunk or not. They may get lucky as it doesn’t say “Failed to load chunks for snapshot…”.

Made a feature request about this very thing.

a.alberti82 · 17 November 2024 11:30

This was a great advice, and I was able to resolve the problems with my storage. I wrote a note for my future self, which I like to share here; let me know if you think something is not correctly represented. I hope this can be useful to other people:

1. Web resources

Forum discussion: link about the problem described in this note.
Fix missing chunks: this note assumes that we are able to recover the missing chunks. Sometimes however we do not want to recover them because they correctly were removed.
Feature Request for a flag option to prune snapshots regardless of whether some chunks are missing. This feature request has not been implemented.

2. Description of the problem

When carrying out backups with duplicacy, it may happen that some of the chunks are missing for a given snapshot. This can occur:

if the storage is defective and files disappear (see also how to fix missing chunks). One could try to run a check with -fossils and -resurrect options.
if purge operations are carried out with -exclusive flag from a computer and simultaneously from another computer backups or other operations on the storage are carried out,
a prune operation was started but then interrupted before completion. Chunks were then deleted, but the prune command stopped before deleting the snapshot revision corresponding to the deleted chunks. This situation leaves behind a snapshot revision that cannot be purged anymore. In fact, an error is shown, saying that given chunks referenced by a given snapshot revision do not exist. Unfortunately, there is no flag that allows duplicacy to delete these snapshot revisions regardless of whether chunks are missing; a feature Request was made to be able to remove these snapshot revisions, but it has not be considered yet. ^interrupted-prune

3. Forcible deletion of snapshot revisions

Sometimes, there is no way to recover missing chunks. This could be for example because of an [[#^interrupted-prune |interrupted prune operation]]; so, there can be good reasons why we want to forcibly remove snapshot revisions.

3.1. Content of a storage

In a storage, one should find three elements in the root: a config file, a folder snapshots and a folder chunks. We are interested in the snapshots folder. However, we describe shortly the two subfolders:

3.1.1. Chunks

The folder chunks contains 256 folders, each named with a hexadecimal number. For a given chunk (e.g., 49305662730665976bd5b4c1454905b3a2d0c5a8953400d4a0e7e796c89df0b1), the first two letters (in the example, 49) denotes the subfolder under chunks. The remaining letters (305662730665976bd5b4c1454905b3a2d0c5a8953400d4a0e7e796c89df0b1) denote the name of the file for the given chunk.

Typically, we do not need to operate manually on the chunk files. Once we have removed incomplete or corrupted snapshots, it is sufficient to run a prune operation on all snapshots (-all option) and with -exhaustive option. This will remove all orphaned chunks.

3.1.2. Snapshots

The folder snapshots contains a series of subfolders, one for each snapshot (backup set) that is backed up. Inside each snapshot folder, we find a series of files named with a number after the snapshot revision. The files are binary and cannot simply be opened in a text editor. If we want to figure out what they contain (usually however not necessary) we can run:

duplicacy cat -id <snapshot_name> -r <revision_number>

3.2. Manual deletion of snapshot revisions

We can find the incomplete/corrupt snapshot revisions by running:

duplicacy check -all -persist

As an alternative to -all, we can choose a particular snapshot with the -id option if we know where the problems are and we want to speed up the process.

It is important to select -persist, or else duplicacy will give up on the first error reported.

Now, the output of the check command given above will look like similar to:

Repository set to /Users/my_home_dir
Storage set to b2://my-backup
Download URL is: https://f002.backblazeb2.com
Listing all chunks
17 snapshots and 1046 revisions
Total chunk size is 506,311M in 130956 chunks
All chunks referenced by snapshot xyz1 at revision 18 exist
All chunks referenced by snapshot xyz1 at revision 20 exist
All chunks referenced by snapshot xyz2 at revision 121 exist
All chunks referenced by snapshot xyz2 at revision 157 exist
All chunks referenced by snapshot xyz2 at revision 193 exist
Chunk 6168db045e48a09d4a02d0e4d8cf0a5a031631285d92a1b418ce33c8cc709b00 referenced by snapshot xyz2 at revision 210 does not exist
Some chunks referenced by snapshot xyz2 at revision 210 are missing
All chunks referenced by snapshot xyz2 at revision 229 exist
All chunks referenced by snapshot xyz2 at revision 262 exist
All chunks referenced by snapshot xyz2 at revision 284 exist
All chunks referenced by snapshot xyz3 at revision 1 exist
All chunks referenced by snapshot xyz3 at revision 1030 exist
All chunks referenced by snapshot xyz3 at revision 1375 exist
All chunks referenced by snapshot xyz3 at revision 2814 exist
Chunk 04457b7d84b3278a23f5c6de4a87069368c9c3fa0ac18e7792a907e134516745 referenced by snapshot xyz3 at revision 2815 does not exist
Some chunks referenced by snapshot xyz3 at revision 2815 are missing
All chunks referenced by snapshot xyz3 at revision 2816 exist
All chunks referenced by snapshot xyz3 at revision 2818 exist
Some chunks referenced by some snapshots do not exist in the storage

In the example, two snapshots have missing chunks, xyz2 at revision 210 and xyz3 at revision 2815. There is currently no way to remove these snapshot revisions with duplicacy CLI. However, we can go to the storage and manually delete the revision files of the incomplete/corrupt snapshots.

If the numbers of files to be deleted is very large, it is worth using a script, such as the Python script I wrote for backblaze storage. The script assumes that the output of the check operation is stored in a local file named duplicacy_check_log.txt

import re
import os
from b2sdk.v2 import *

ACCOUNT_ID = '123123132'
APPLICATION_KEY = '012340123401234012340123401234'

info = InMemoryAccountInfo()

b2_api = B2Api(info, cache=AuthInfoCache(info))

b2_api.authorize_account("production", ACCOUNT_ID, APPLICATION_KEY)

bucket_name = 'my-backup-name'

bucket = b2_api.get_bucket_by_name(bucket_name)

# Define the regex pattern
pattern = r'Chunk .*? referenced by snapshot (.*?) at revision (\d+) does not exist'

# Initialize a set to keep track of already deleted files
deleted_files = set()

# Open and read the log file
with open('duplicacy_check_log.txt', 'r') as file:
    for line in file:
        # Try to match the pattern with each line
        match = re.match(pattern, line)
        if match:
            # Extract the snapshot name and revision number
            snapshot = match.group(1)
            revision = match.group(2)

            # Define the file name or path in the B2 bucket
            b2_file = f"snapshots/{snapshot}/{revision}"

            # Check if the file has already been deleted
            if b2_file in deleted_files:
                continue

            # Add the file to the set of deleted files
            deleted_files.add(b2_file)

            try:
                # Fetch the file information and delete it
                file_info = bucket.get_file_info_by_name(b2_file)
                bucket.delete_file_version(file_info.id_, b2_file)
                print(f"Revision {revision} deleted from snapshot {snapshot}.")
                

            except Exception as e:
                print(f"Revision {revision} not found in snapshot {snapshot}.")

Note that if you use a different storage, you would have to adapt the script.

3.3. Final check

After the incomplete/corrupt snapshot revisions have been deleted, it is important to run:

duplicacy prune -all -exhaustive

Use in addition -exclusive if you are sure that no other backup is currently run on the chosen storage.

Finally, you can a check to verify that the storage is again in good shape:

duplicacy check -all -persist