Why does it take so long to delete a single revision?

wikify
prune

#1

I want to delete revision 20 of my snapshot because it contains a number of large files that were not meant to be backed up. So I do:

duplicacy prune -r 20

It takes ages. (I didn’t check when exactly it finished, but probably longer than 30 minutes.)

Why does it take so long? I would imagine that it’s a matter of deleting the snapshot file. Done. The rest will be taken care of by subsequent prunings, no?

Looking at the Task Manager, I also see that during that long deletion time, duplicacy is using significant amount of CPU (between 20 and 40%) as well as quite some bandwith. It almost looks like it is downloading all the snapshots in the entire revision in order to delete them. :woozy_face:

What is Duplicacy doing?


#2

I imagine prune is making a collection of chunks for future deletion. It’ll create the fossil collection file and rename the chunks to .fsl - hence, why you might be seeing that sort of bandwidth. Then it deletes the snapshot file.

If you haven’t run prune before in that repository, it should also download snapshot chunks (not file chunks)… i.e. meta data associated with every(?) revision and every(?) repository stored in that storage, and places it in the cache. But I can’t imagine that would occupy too much space, unless you have a lot of snapshots?


#3

I wonder if this is a general problem with prune if you don’t use multi-threading? I’ll try duplicacy prune -r 20 -threads 6 and see if that changes anything.

That could explain something. I don’t think I did run prune on that machine before and while that machine/repo only has 20 revisions there are quite a few (maybe 2-300?) for all repositories combined.

But the question is: why does it do that? Why does it need a local copy of all snapshot metadata when running prune, especially when it has a very simple task: delete revision 20?

I’m not complaining about space use (though I didn’t check it), just network bandwidth.


#4

This reference should be updated with my response eventually: Snapshot file format .

In order to do its job, prune needs to know all the snapshots for all the repositories.

2018-08-06 07:25:32.405 TRACE SNAPSHOT_LIST_IDS Listing all snapshot ids
2018-08-06 07:25:32.406 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot asus
2018-08-06 07:25:32.409 DEBUG DOWNLOAD_FILE Downloaded file snapshots/asus/1
[...]
2018-08-06 07:25:32.432 DEBUG DOWNLOAD_FILE Downloaded file snapshots/asus/17
2018-08-06 07:25:32.433 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot tbp-pc
2018-08-06 07:25:32.437 DEBUG DOWNLOAD_FILE Downloaded file snapshots/tbp-pc/1
2018-08-06 07:25:32.440 DEBUG DOWNLOAD_FILE Downloaded file snapshots/tbp-pc/1491
[...]
2018-08-06 07:25:32.552 DEBUG DOWNLOAD_FILE Downloaded file snapshots/tbp-pc/2958
2018-08-06 07:25:32.552 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot tbp-v
2018-08-06 07:25:32.558 DEBUG DOWNLOAD_FILE Downloaded file snapshots/tbp-v/1
[...]
2018-08-06 07:25:32.663 DEBUG DOWNLOAD_FILE Downloaded file snapshots/tbp-v/521
2018-08-06 07:25:32.664 INFO FOSSIL_COLLECT Fossil collection 10 found

This is needed because prune needs to know firstly which chunks belong to your desired snapshot, and secondly which chunks are needed for the other snapshots which won’t be deleted.

All the snapshot chunk list have to be downloaded

But this is not enough (only having downloaded snapshots). As it is explained in Snapshot file format, the chunk list referenced in each snapshot is actually kept separately and split, because it can get gigantic for large backups (like yours @tophee :P). Because of this, many more chunks are downloaded again.

:question: @gchen: i also need an explanation here: i believe that ALL chunk list chunks need to be downloaded when pruning, in order to have the complete chunk per snapshot utilisation, right? (not only those for that particular revision)

2018-08-06 07:25:34.357 DEBUG CHUNK_DOWNLOAD Chunk 92b4fffdd881516634b7bd001883300d9baaeed4fd6a9e49f9129f826dd0b9c6 has been downloaded
2018-08-06 07:25:34.778 DEBUG DOWNLOAD_FETCH Fetching chunk 6be28a04d0ea7f4a9fbdb4014dd57bf03fb7d2f6f67c6bea0a30ad50d34a6c38
2018-08-06 07:25:34.881 DEBUG CHUNK_DOWNLOAD Chunk 6be28a04d0ea7f4a9fbdb4014dd57bf03fb7d2f6f67c6bea0a30ad50d34a6c38 has been downloaded
[... about 150 chunks later]
2018-08-06 07:25:51.422 DEBUG DOWNLOAD_FETCH Fetching chunk c58406b9613b3987811febed35040b4535cd496cd3ca026d8cde00a3d1d9a9ea
2018-08-06 07:25:51.427 DEBUG CHUNK_DOWNLOAD Chunk c58406b9613b3987811febed35040b4535cd496cd3ca026d8cde00a3d1d9a9ea has been downloaded
2018-08-06 07:25:51.432 DEBUG DOWNLOAD_FETCH Fetching chunk a11aebee65ae323119eda614ca1968db6a026c3a55618734ae28cf61a373c98c
2018-08-06 07:25:51.436 DEBUG CHUNK_DOWNLOAD Chunk a11aebee65ae323119eda614ca1968db6a026c3a55618734ae28cf61a373c98c has been downloaded

List all the chunks in the storage

Okay: so far duplicacy knows all the chunks which belong to which revision, so it can chose what to actually delete. Now what i see is that duplicacy lists all the chunks in the storage.

I do believe that this may be needed to check which chunks are fossils (but for that there’s already the fossil collection s), possibly for restoring those that need to be restored.

:question: I have no idea why duplicacy does this and thus i ask @gchen again to explain (as detailed as possible -> w/o pasting code, even though that is the one true truth :D).

2018-08-06 07:25:51.442 TRACE LIST_FILES Listing chunks/
2018-08-06 07:25:51.446 TRACE LIST_FILES Listing chunks/ff/
2018-08-06 07:25:51.789 TRACE LIST_FILES Listing chunks/fc/
[... about 260 folders]
2018-08-06 07:25:51.900 TRACE LIST_FILES Listing chunks/fb/
2018-08-06 07:26:17.356 TRACE LIST_FILES Listing chunks/01/
2018-08-06 07:26:17.460 TRACE LIST_FILES Listing chunks/00/

Finally do the pruning

And finally, after downloading all those chunks and figuring out which are actually good to delete (read fossilize!) duplicacy starts doing the actual pruning: take each of the chunks, make it a fossil (add “.fsl” as the extension) and add it to a new fossil collection (which will be deleted in another prune run).


From my observation, the longest part of pruning is the actual making of fossils since at least 1 API call is needed for each chunk (directly rename the chunk, or copy the chunk with new name + delete old chunk). There possibly may be multiple calls, depending on the folder hierarchy – but i’ve got no idea about that.


:question: I suppose there are other steps, like cleaning old fossil collections, restoring fossilized chunks which are needed now, and other, but i don’t know what, so i’ll let @gchen improve on my answer.


#6

Okay, I did it and it went a lot faster now, but it is difficult to say whether that was because it was no longer the first time I ran prune (though none of the previous runs was successful) or because of the multi-thread. I also used -log which I failed to do earlier.

And being able to look at the log, a few more questions arise:

Duplicacy should be more talkative

PS C:\> duplicacy -log prune -r 20 -threads 6 -storage pcloudwebdav
2018-09-19 09:35:41.529 INFO STORAGE_SET Storage set to webdav://<myusername>@webdav.pcloud.com/Backup/Duplicacy
2018-09-19 09:43:01.884 INFO SNAPSHOT_DELETE Deleting snapshot ALPHA_C at revision 20
2018-09-19 09:54:45.291 WARN CHUNK_FOSSILIZE Chunk 3d58ffa679ede7257e6581da4c040dd9b7969cb158e45dc04d27f353a3e37c52 is already a fossil
2018-09-19 09:54:45.448 WARN CHUNK_FOSSILIZE Chunk 5e530ad112d412eee874d244b1e574c3bb4eaa58d0c5d44dc68759d55e8ea796 is already a fossil

What was duplicacy doing between 09:35:41.529 and 09:43:01.884? And then between 09:43:01.884 and 09:54:45.291?

In the case of the second empty time period, I’m guessing that it was fossilizing chunks. But it doesn’t seem plausible to me that Chunk 5e530ad112d412eee874d244b1e574c3bb4eaa58d0c5d44dc68759d55e8ea796 is already a fossil would be classed as a warning, while the actual act of fossilizing a chunk that was not a fossil previously is not communicated at all. If one should be ranked more critical than the other, then, surely, fossilizing would be more critical than finding out that something has already been fossilized, right? But I don’t think either is worth a warning at all. Both should be classed as INFO.

At the very least, both acts should be logged. Just like in the backupcommand where both uploaded and skipped junks are logged as INFO.

More generally, whatever duplicacy was doing during those silent periods, I think it should talk about it. In fact, I’d like to suggest that 7 minutes of silence is bad UX in any command, not just prune. It would be great if duplicacy would default to printing an update at least once per minute, if -log is used, of course. If -log` isn’t used, I would nevertheless deem it good practice that it at least responds with an acknowledgement of the command and a hint that this may take some time. Something like

PS C:\> duplicacy prune -r 20
Storage set to webdav://<myusername>@webdav.pcloud.com/Backup/Duplicacy
Starting pruning. This may take a while...

Currently it only confirms the storage, not the start of the operation would like to raise

Check for latest revision immediately

Another issue I would like to raise (and which is more on topic here) relates to this:

PS C:\> duplicacy prune -r 20 -storage pcloudwebdav
Storage set to webdav://<myusername>@webdav.pcloud.com/Backup/Duplicacy
The latest snapshot ALPHA_C at revision 20 can't be deleted in non-exclusive mode

This was my very first try at deleting revision 20. Unfortunately, I forgot the -log option so we don’t have the times, but I can tell you that it took at least 30 minutes, possibly several hours, before duplicacy concluded that it cannot delete the revision.

If I understand the message correctly, the reason why it could not be deleted was simply that it was the latest revision. Fine. But why does it not tell me so immediately? It seems to me that the order of things needs to switched around: First check if revision is latest and abort if so, then start downloading snapshot meta data. Currently it seems to be the other way around.


#7

when doing any sort of debugging you should run duplicacy -log -d <command>


#8

It was doing 2 things: listing all chunks in the storage and downloading all snapshots to construct the chunk reference map. Considering it is pcloud and the protocol is WebDAV, it is not a surprise to me why it is so slow.

The -threads option won’t help here, since that is the number of threads used to rename chunks. Duplicacy only uses one thread to list chunks.


#9

Okay, thanks.

What about:

and

and

@gchen?