Migrating from CrashPlan to Duplicacy

I have this idea about how I might be able to preserve some of my backup archive from CrashPlan when changing to Duplicacy and I wonder what the pitfalls might be or if there is a better strategy.

So, to start with, I think I can live with losing the the different versions/revisions of my backed up files that I can currently still access in CrashPlan. My main aim in this whole exercise is to copy over the latest version of all files I ever backed up, including (and in particular) the ones that have in the meantime been deleted (I have CrashPlan set to keep at least one version of every file, no matter what). Simply because chances are that at least one of them has been deleted by mistake. Maybe recently, or maybe two years ago.

Anyway, the basic idea is that I restore the latest version of all files in my CrashPlan archive into a separate folder (not to the original location) and then I’ll simply back up that folder.

First I thought I’d create a separate repository for that folder, but then I figured that is not necessary (or perhaps even disadvantageous). I’ll just include the folder in my existing repository and delete it once it has been backed up.

I’ll probably tag that snapshot sith “crashplan” or something so that I know where to find those crashplan files, should I ever need them.

Now here is a crucial question: how do I make sure that that snapshot never gets deleted (since it’s the only one with those (deleted) crashplan files)? Is there any way to mark a snapshot as “untouchable”, i.e. to tell the prune command to simply leave it alone? The only thing I can find is the -ignore option but I don’t quite understand what it does. The wiki says:

It also provide an -ignore option that can be used to skip certain repositories when deciding the deletion criteria
.
Does that mean that the ignore option makes the snapshot immune against deletion or the opposite? There also seems to be a confusion with the term “repository” and “snapshot”. The above quote talks about repositories while the official help text talks about snapshots

-ignore <id> [+]         ignore the specified snapshot id when deciding if fossils can be deleted

And then there seems to be a nuanced distinction between “snapshot” and “snapshot id” that adds to the confusion:

For fossils collected in the fossil collection step to be eligible for safe deletion in the fossil deletion step, at least one new snapshot from each snapshot id must be created between two runs of the prune command.

Anyway, if it is possible to be off topic in the topic-defining post, then this was off-topic. So back to the real stuff: how to prevent a specific snapshot from deletion? If it’s not possible, here are some specs for a feature request:

  • In order to prevent deletion, one should not have to specify the protected snapshot everytime the prune command is run. For obvious reasons.
  • This means that the information about protected snapshots needs to be stored in the backend.
  • One option would be to write it to the config file
  • But the easier option is probably to hardcode a specific tag (e.g. protected) as marking an undeletable snapshot. Done. It’s a bit less flexible and users are forced to use that specific tag, but its so much easier to implement (and simpler to use)…

I would think the best option is to change the repository id after your Crashplan files are backed up to the storage. This way, the snapshot you create from the initial backup will be the only snapshot with its repository id. The only way you can delete that snapshot is that you specify that repository id and the -exclusive option, since without the -exclusive option you can’t delete the last snapshot of a repository id.

1 Like

I don’t think that’s a good solution because it means that I have to run two separate searches whenever I look for a file, right?

Wouldn’t it be enough to prune with parameters like prune -keep 1:36000 and never use a -keep 0:whateverNumberOfDays and be fine for the next century?

Also for safety you could combine both, first do a full backup with it’s own snapshot, then create a new, active one, do the full backup again (which should not waste any storage due to deduplication) and then use that one as your active archive with a prune setting like f.i.

duplicacy prune -keep 36000:180 -keep 30:180 -keep 7:30 -keep 1:7

Wouldn’t it be enough to prune with parameters like prune -keep 1:36000 and never use a -keep 0:whateverNumberOfDays

As I understood it (which may well be wrong), that will not guarantee that my crashplan snapshot survives the more short term prunings. In other words, even if there will always we one snapshot left, even after years, chances are that it will not be my crashplan snapshot.

This, on the other hand, looks like a very smart idea:

Also for safety you could combine both, first do a full backup with it’s own snapshot, then create a new, active one, do the full backup again (which should not waste any storage due to deduplication) and then use that one as your active archive

My only worry here woukd be: since the other repository will not be active after the first backup, is there not a risk that its chunks will gradually be pruned out by pruningactions in the other repositories? But I may be misinterpreting the meanibg of the 7day period in the prune command…

I don’t think that pruning will remove blocks referenced by other snapshots from other snapshot-ids, this would make the whole system very fragile but I’m not sure anymore that my other idea with regards to the prune settings would work as expected (especially after reading your other related posting).

So better waiting for Gilbert to jump in …

Right, if an existing snapshot isn’t included in the snapshots to be deleted, all chunks that it references are guaranteed to remain in the storage. The -ignore option only means it will ignore an ongoing snapshot from that repository id, if there is one, that started before the fossil collection step and takes more than 7 days to finish.

Right, if an existing snapshot isn’t included in the snapshots to be deleted, all chunks that it references are guaranteed to remain in the storage.

Excellent! So that’s that solved! Thanks a lot, Nelvin!

The -ignore option only means it will ignore an ongoing snapshot from that repository id, if there is one, that started before the fossil collection step and takes more than 7 days to finish.

Sorry if this sounds like a dumb question, but it’s important to be avoid misunderstandings here: so ignoring that ongoing snapshot implies what? That it will delete chunks that might be used by that ongoing snapshot or that it will not do that?

so ignoring that ongoing snapshot implies what? That it will delete chunks that might be used by that ongoing snapshot or that it will not do that?

It implies that a backup taking longer than 7 days may have some of its chunks uploaded 7 days ago deleted by a recent prune operation. So you should run a check command right after such a backup (see the notes in Prune command details).

It implies that a backup taking longer than 7 days may have some of its chunks uploaded 7 days ago deleted by a recent prune operation.

I see. But why would you want to do that?

see the notes in Prune command details

The notes in the wiki only say this:

It also provide an -ignore option that can be used to skip certain repositories when deciding the deletion criteria.

So it is not clear there either (which is why I asked the question).

The fossil deletion step can only proceed if there is one new snapshot from each snapshot id. But an inactive repository that has not done any backup for a long time would prevent any fossils from being deleted. Therefore we need to set a period of 7 days. The -ignore option can also be used to force the fossil deletion step even without a new snapshot from a certain snapshot id (for instance, in case you know for sure there wasn’t any ongoing backup in the fossil collection step from the snapshot id).

Okay, so basically, you only want to use the -ignore option if you have such an inactive repository and you don’t mind the backups from that repo to be be potentially destroyed, right?

But that also means that we’re basically back to square zero (i.e. the OP), because Nelvin’s ingenious solution won’t really work, unless I am fine with fossils piling up in my story for ever:

first do a full backup with it’s own snapshot, then create a new, active one, do the full backup again (which should not waste any storage due to deduplication) and then use that one as your active archive

In this scenario, there is, by definition, always going to be an inactive repository, which means no fossils will ever be deleted.

No wait, the seven day period will eventually apply:

Duplicacy by default will ignore repositories that have no new backup in the past 7 days. Source

which means after seven days the Crashplan snapshot will be ignored and unless that ignoring is a different kind of ignoring than the one triggered by the -ignore option, which, I understood, means that that snapshot risks losing some of its chunks.

But at the same time, you confirmed above that

if an existing snapshot isn’t included in the snapshots to be deleted, all chunks that it references are guaranteed to remain in the storage.

I’m confused.

I’m confused.

Me too!

… the one triggered by the -ignore option, which, I understood, means that that snapshot risks losing some of its chunks.

This is incorrect. When a snapshot id is ignored, it doesn’t mean that its snapshots will not be counted when determined which chunks are not referenced. Rather, it means that any new snapshots that could have started when fossils were identified from this snapshot id will be ignored. For any existing known snapshot, if you don’t specify it to be deleted, its chunks will never be marked as fossils.

For any existing known snapshot, if you don’t specify it to be deleted, its chunks will never be marked as fossils.

Now that you say it, it sounds all so simple! Thanks for explaining. Nevertheless, I dare assume that I am not the only one who find it challenging to wrap their head around how duplicacy works. So I think it might be useful to have a flowchart for how the prune command works.

You made a start in visualizing the logic of Lock free deduplication algorithm but it’s very basic and didn’t help me solve the puzzle above.

Regarding the need to keep a copy of every deleted file, here is another idea for a workaround: what if I create a separate snapshot ID for my recycling bin? That should cover at least those files that are deleted via the recycle bin, right? Of course, there is always the possibility that files are deleted without going through the recycle bin, but I can’t think of much more than

  • the user deliberately deletes directly (Shift + delete)
  • an uninstall script deletes files
  • Oh, and network shares. Hm, that is actually a bit of a bugger.

Anyway, what do you think? Are there any pitfalls with this?

That should work. I would even suggest backing up the recycling bin to a cheaper storage as the need to recover a deleted file is less likely than the need to recover an older version of an existing file.

It looks like this doesn’t work. The recycle bin is something of a special folder. The files are not stored under their original names:

PS C:\$RECYCLE.BIN> cd .\S-1-5-21-3423957463-4063420650-2348731123-1003\
PS C:\$RECYCLE.BIN\S-1-5-21-3423957463-4063420650-2348731123-1003> dir


    Directory: C:\$RECYCLE.BIN\S-1-5-21-3423957463-4063420650-2348731123-1003


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-r---       2016-04-11      6:48                $R8GRQCT
d-r---       2016-01-01      1:19                $RR8P96S
-a----       2018-03-01     22:08            102 $I3WV0DD.jpg
-a----       2018-03-01     22:09             98 $IJFBTCK.log
-a----       2018-03-01     22:08            102 $IPWDI29.jpg
-a----       2018-03-01     22:54            118 $IVZBAZO.log
-a----       2018-03-01     22:09             70 $IZ2PQOO.lnk
-a----       2018-02-27     20:59         254908 $R3WV0DD.jpg
-a----       2016-05-02      1:27            366 $RJFBTCK.log
-a----       2018-02-27     20:58         239032 $RPWDI29.jpg
-a----       2018-03-01     22:16            114 $RVZBAZO.log
-a----       2017-12-31      2:01           1028 $RZ2PQOO.lnk


PS C:\$RECYCLE.BIN\S-1-5-21-3423957463-4063420650-2348731123-1003>

So even if it is possible to backup those files, it wouldn’t be possible to actually find any file.