Perpetual, Versioned Backups (Deletes)

trickyice · 13 November 2023 05:03

I’m looking for something that will provide something akin to tape backups like:

I take a full backup of the filesystem (call this Backup 1) to an S3-compatible online storage
I take incremental backups every hour for the next one year (call these Backups 2 - 8,760)
During the time of the incremental backups, there will be files deleted from the source
Ability to restore all files up to a specific revision #, meaning all files, even if they were not backed up specifically in that revision.

Setup: I have a single backup job configured in the Web GUI that is running hourly after the initial large backup completed. There is no purge configured – just the single backup job. All of the hourly backups are running quickly and updating the destination correctly, it appears.

Q1: Is there any note of those files being deleted in the Duplicacy repository?
Q2: If I restore to some incremental version, like 3,500, will all of the files backed from revision 1-3,500 be restored (knowing, of course, that any files overwritten, only the newest up to version 3,500 would be restored)?
Q3: If I did that restore, to revision 3,500, would all of the files that have been backed up between revision 1 and 3,500 be restored, or only those in that specific revision (which could be just a single file)?
Q4: Important If I restore all files, at revision 3,500 (in hopes the answer to Q3 is that it will restore all files backed up…up to revision 3,500) to a new location, will files that have previously been deleted also be restored, even though they had been deleted in the source?
Q5: What role does Purge play in this? In my circumstance, I’m thinking no role because I do not want any data removed from the destination specifically – only that the data won’t be restored if it has been locally deleted.

Ultimately, the goal here is a backup that holds onto deleted files but won’t restore them if they’ve been deleted. This allows me to restore a file that someone deleted but doesn’t realize it very quickly. But also allows all deleted files not to be restored if there happens to be a catastrophic event and the total backup needs to be restored to a revision.

Thank you for this fantastic community. I have done a lot of searching, and it doesn’t seem like these specific questions have been answered yet – or I was not able to find them. I did find this:

Deleted files - nothing in the log

– Brian

saspus · 13 November 2023 07:20

Hi Brian,

Welcome to the community!

Duplicacy creates versioned history of the state of your repository. Each revision therefore reflects the state of the filesystem at the time of backup. Files that existed at that time will be present in the state they were at the time. Files that did not – won’t.

On windows and macOS duplicacy can create local filesystem snapshot at the start of backup and backup it instead of the live filesystem-- so that even if backup takes 20 hours and some files change during the backup – the revision backed up will have files as they were at the start. This may be important to ensure consistency of your dataset. See -vss flag to the backup command.

No. Only files that were present when revision 3500 was taken will be available for restore, at the state the were at the time revision 3500 was created. (You can choose to restore a subset of those files of course, you dont have to restore the whole dataset, just fyi)

Only those in specific revision. Remember, each revision reflects the state of the filesystem at the time of backup.

Skipping this, as the premise is wrong. Only files present in the revision will be restored.

Prune deletes specific revisions matching certain conditions. For example, often people don’t want to keep hourly backups going back years; for data older than few months perhaps weekly backups are enough. It just makes it easier to manage – you probably don’t care about such fine granularity of something that happened years ago. Reducing frequency of backups going back in time can also free some space if there was a very short lived set of files in the past. It may also improve restore performance, by not needing to enumerate massive number of snapshot versions.

A sensible choice for example is to keep hourly backups for the past 24 hours, daily backups for the past month and weekly backups for all previous months. This happens to be the default at other backup tools, like macOS’s Time Machine.

Not only deleted files – but corrupted ones, as a result of bit rot, or user error. One might imagine a photo you haven’t looked at for 10 years, only to realize that it rotted 5 years ago. You should be able to go back in time and restore uncorrupted version.

This is a definition of what backup software should be able to do Duplicacy, as do many other tools, are designed to behave just like this.

Duplicacy takes it a step further: each revision behaves like an independent backup, and yet it is still incremental, deduplication works across machines, and no locking database is used, dramatically improving reliability. If you want more nitty gritty details here is a paper describing how it works: https://github.com/gilbertchen/duplicacy/blob/master/duplicacy_paper.pdf