Full backup vs revision/snapshot

I just recently started looking into Duplicacy, and I think it could be a very viable option. I was just trying to make sure I understand how the software is working compared to something I was using before: Acronis TrueImage Home.

In Acronis, I had set up a backup policy that would create a full backup, do 6 incremental backups, then start a new chain (full + 6 increments). I could then specify how many “chains” of backups I wanted to retain. On the other hand, if I used their cloud storage, it would do an initial full backup, then every subsequent backup would just be an incremental one.

Is the way Duplicacy working similar to the Acronis Cloud Backup policy I mentioned above (one full, all subsequent as increments)? While efficient, I was always concerned that if any backup (link in the chain) ended up corrupted, then all further backups down that chain would become useless. Is that a possibility with the way Duplicacy does backups, or am I misunderstanding something?

If that is indeed both how it functions and a possible risk (even if it is remote), is there a way to schedule a “restart” of the chain, i.e. a full backup? I set up the default pruning retention policy on the web page (-keep 0:1800 -keep 7:30 -keep 1:7 -a), but are these applying just to the incremental portions? I guess my difficulty/confusion is seeing how the concept of full/increment/chains are translating to Duplicacy terminology. Not sure if revisions and snapshots are the same thing.

Not at all :slight_smile:

In Duplicacy backups are essentially “incremental forever” however all backups are also “full” too…

Backups (ie revisions in Duplicacy terminology) are comprised of the unique chunks that make up that particular point in time, but these chunks are shared by all other backups too, which is how the deduplication process works, so a particular “pattern” (might be a file or part of a file) is only stored once in the storage which then may be shared by any/all other revisions of that snapshot…or even revisions of other snapshots that share the same storage.

This does introduce some level of risk if you lose chunks due to corruption or some other failure as this may affect many files in many revisions so its important to run regular duplicacy check commands to ensure all chunks are present and also have a second copy somewhere on different storage via duplicacy copy so corruption/loss of chunks on one storage don’t affect your ability to restore data.

The risk here is nothing Duplicacy specific though…this is something that all de-duplicating backup/storage solutions suffer from, which can be mitigated by the above steps and also the new “erasure coding” feature (New Feature: Erasure Coding) can protect against chunk corruption too.

4 Likes

Thanks for the info, I think I have a much better idea of how it is working. I just need to sit and think about how I want to set it up. I’ll also need to look into Erasure Coding, but doesn’t seem like it’s available in duplicacy-web yet, and I’ve been using a Docker image to get everything set up.

My current thinking is to do a daily backup, with -keep 0:365 -keep 30:30 -keep 7:7 -keep 1:1 -a so as to get daily snapshots of the last week, weekly snapshots of the last month, and monthly snapshots of the last year. I’d backup, prune, then check every day around midnight. The Erasure Coding might give some resiliency to this when it becomes available. The copy could be a similar mitigation to a “Full Backup” in the chain schema that Acronis was using. I do a copy every week, so if I need to reset the “chain” of revisions due to a check failure, I can do so using the copy.

My particular process/schedule is as follows:

  1. Backups hourly to my local NAS over SFTP
  2. Immediate duplicacy copy from SFTP to cloud storage (OneDrive in my case) executed via a post backup script (see here for info: Pre Command and Post Command Scripts)
  3. duplicacy check every evening
  4. duplicacy prune every morning

I’ve got some scheduling logic so backups won’t overlap in there too (I’m using the CLI version).

My storage’s pre-date the new erasure coding feature so creating new storage’s with this enabled and copying my existing backups so chucks are protected from corruption is on my todo list.

Just a few more questions to finish my setup.

  1. What’s the difference between doing say 2 separate backups, or doing a backup and a copy? If I do a backup and a copy, and the backup ends up corrupt, doesn’t that mean the copy will also be corrupt vs two independent backups?

  2. If I understand correct, erasure coding requires a fresh initialization of storage, so if I am using things like BackBlaze, should I wait for this feature to arrive on the Web UI rather than wasting the initial upload?

  3. Is there any benefit to backing up the backup? I still have a bit on my Acronis Subscription, and I was unsure if I gained anything by pointing it to a Duplicacy backup so it could version the chunks in its cloud storage. I assume this is probably overkill.

EDIT:
4. My overall planned structure was something like this. It would be a backup to a local HDD, a backup to a GDrive mounted locally, and a backup to B2. Does this look fairly resilient? I just want to set up backups and not worry too much about them…but I’ve been burned before

Sorry for double posting, but I figured it was better to just post here rather than creating a new topic. I spent the last few days researching, so I understand why copy is better than backup and got my local drives set up with Erasure Coding (but not B2 since I read there’s no point for that). I just wanted to verify that my job is looking solid now

  1. If CloudDrive and Local are considered spinning disks, there’s no point in doing more than 1 thread, correct?
  2. Is there a particular order of operations that is optimal here? I couldn’t figure out if it meant sense to do all the copies, then prune, then check, or if to do each storage sequentially
  3. I have a Threadripper 2950x, so I’ve got CPU power. Should I bump up the thread counts to 30 for checks/prunes, or there’s no point? Also, should I run those tasks in parallel?